What is its goal? Something that allows people to write the same test as another system has produce a test with the same output
What is a “specification”? akin to/analogous to RFCs English documents, concise in language should concisely define what a test does inputs, outputs, etc. Two people reading the test specification should produce the same results
Does it need to define the process by which the test is run? It should be very concrete The only way to make this completely unambiguous is to write code Danger is that we may end up writing our own programming language
Idea: Instead of having an English-language document, define a standard way of writing a test in an agnostic programming language that defines high-level APIs
Literate programming: invert comments and code (e.g., like a latex document with code in it). This implies the specification of such a language, and the person who wants to implement needs to have knowledge of such a language
How concrete should the specification be? Leaving certain things (e.g., header formats) open to interpretation could end up affecting the results Could have something that could be run itself, but also provides enough information to allow someone to go implement it themselves (?) Lots of confusion/dissonance about what to leave unspecified vs. default vs. explicitly spelled out How do you know that the specification is consistent with the code, unless it is the code itself? The test specification could also serve as a guide in reading the code. (Note: Current specifications aren't really specifications, but rather just documentation.)
Working backwards, consider the current state of affairs and how can we improve it? Requires reading OONI spec, interpreting, porting.
1. What Sam did to port OONI test to a new platform Understand OONI abstractions Read the code of the test Reimplment What's missing? Sam didn't run the test. (Sam couldn't get it to run.) Note: There's no verification that Sam understands what the test is actually doing, or that his reimplementation is the same. Do the two tests behave the same in weird edge cases?
What would help the process Sam went through in the future Instead of having an explanation of the test, have an explanation + lots of examples (reference implementation) Unit testing framework will extract code examples and verify that the do what they say they do See that your implementation matches the test Black-box unit testing
Some open questions: How complete does the implementation need to be -- Requires a test harness, running in emulated censorship environments. What kinds of questions is this test trying to answer? How are we facilitating the running of the test?
Can't anticipate edge cases ahead of time. The best we can hope to do is replicate results in the same environment.
Something to think about: We are building a system for collecting data. Data collection and data analysis/decision framework can certainly be two different things. It may be worth considering that certain tests may require separating data collection from subsequent analysis.
Lots of arguments for why you wouldn't want the clients to not do interpretation, only collection. (Once you've collected, you can't get it back, but you can analyze as many times as you like.)
Should specification of how the data should be analyzed be part of the test?
Could consider completely separating data collection and analysis. Is the data collection sufficient to begin running the analysis? Will the analysis result in the same conclusion (for multiple versions of the same test)?
Client should be as simple as possible. If it can really just do data collection, that would be awesome.
Rough idea of pipeline:
[Emulated] Network -> Data Collection -> Analysis -> 0/1 Unit tests could then basically be to test that induced changes in the emulated network produce the same 0/1 outputs. Given a common data format, there could be one implementation of the analysis code, given multiple implementations of the collector code
Approach: Start with "what" tests, follow up with "how" tests. (simpler tests) Could build more complex "how" tests by composing many simpler pipeline tests, as above.
Idea: write e2e pipeline for collection+analysis, which then is "automatically" split. One caveat: Analysis on-client could trigger additional measurements.
Creating an ideal/shared data format
Questions; What kinds of data are we even collecting in the first place? What do we want to store? How? Where? What do we collect vs. what do we store? Do we have a schema? Does it need to be human-readable?
Important considerations for metadata: (Header on the data itself.) What version. Client version number, in addition to the git commit. (Note: could need version/commit numbers for data produced as a result of both collection and result.) What test specification one was trying to follow when generating the data (this could just be a git commit hash) What implementation What platform Location (IP address, or at least the network/ISP + rough geography) --> On second thought, perhaps this belongs in the test data itself Timestamp Something uniquely identifying the "test run" (Run ID. Might be the time that the collector/coordinator initiated the test.) Need a notion of provenance/chain of custody. When was the data measured vs. collected vs. uploaded? If scrubbed, what was used to scrub, and by whom? [Example: realizing that a network was compromised during a particular time, data collected from invalid client, incomplete or buggy data, etc.] Useful for both expunging buggy data and for ex post facto analysis.
Note: Some of the above may complicate anonymity issues. [Let's put these aside for now.]
Additional information may include things about "helpers" (e.g., versions for MLab servers, etc.)
Example: HTTP requests and records responses
What data is needed? (Minimum set to determine "what". Ignoring "how" for now.) Collection Phase (Read Only after collection): HTTP headers of request and response HTTP body Timestamps for request and response Vantage point, since different measurements may have different ways to express vantage point (IP address, etc.) Analysis Phase: 0/1 (referring to "blocked" or "not blocked") For more complicated test, could include causes, perhaps with confidence. Very important to give people a way to keep collection and analysis separate.
All raw data comes from the measurement device itself.
Question: Should the file be YAML, JSON, SQLite?
Analysis should include a pointer to the original raw data that was used for the analysis. Contact information for the person who did the analysis.