Data Openness

From wiki.measurementlab.net
Jump to: navigation, search

Spectrum b/w perfect security/privacy and openness-> Aim -- answer what necessary for each

Questions: How can we ensure secrecy in future? How can we define safe(open) vs unsafe (un-open, private) Potential for de-anonymization-> can we truly anonymize data How can we balance security of data with utility for research?

Security cannot be obviated by informed consent

Safe vs unsafe Safe-> does not include people Unsafe-> could be used by censor to improve their products

Possible arms race/ counter examples-> give circumvention tools an edge. Potential for research to improve censorship

If we narrow down to censorship, does that help us increase utility and decrease risk?

Distinction: data about people at risk vs data collected by people at risk

New analysis after initial data release->don't have to go out and collect data Question about the reproducibility of research

Need continued security auditing->find problems throughout life of system-> researchers find new ways to predict de-anonymizing techniques. -- We don't have to trust ourselves to keep the data secure

Consultants to monetize data

PII may give researchers enough info to make better results

Company profits from closed data Closed benefits Open vs closed data

What if we have so many people contributing data that the censor can't attack them all-> they could just make an example of someone

Does collection method need to be "open" or just data/results? What does open mean? Does open mean standard type, format, etc.

Problem 1: doesn't actually prevent identification of participants Problem 2: public vulnerable to a small group of people-> smaller group, yes, but still vulnerable

Question: What if we limit access to queries, so that researchers can run scripts and see results, rather than get data?

Possible solution -- data not about people-> not collected about person and not collected by person who could be identified by data

Question: Does consent of the user imply the ability to publish any data?-> No, they depend on researchers to protect their interests and they can't consent for others who may be impacted by measurements they consented to

Question: If we make it hard enough to identify, is it secure? Not an obvious way to be truly safe, but there are ways to make it harder-> an encrypted database of credit card numbers is better than storing them in plaintext

Not good enough-> governments could retroactively attacked children for actions of parents

Question: What if we could make the stronger claim that data is secure for x number of years? Type of measurement indicates risk

Assumption: being identifying as a participant puts you at risk-> the amount of risk varies

Idea: brand as network quality/ diagnostic tool-> gives plausible deniability

What does it mean to be safe?

Idea-> brand as network quality/ diagnostic tool-> gives plausible deniability. Raises questions: What is safe enough? What tradeoffs should we make? Should a review board decide what is safe enough? Can we make the collection methodology open, but don't give access to data? Axes-> Open or closed spectrum for methodology, results, data format, standards


Data openness (group 2)

What data should be collected to:

1. perform useful tests 2. can be published publicly

Example: Tor Bridge IPs But, what if the HTML contents of the pages tested for censorship contains sensitive data?

M­Lab's approach: Don't give me anything you don't want me to give to everybody else (Tor tries to implement the same thing).

Issues: Anonymizing data is harder than it sounds. There are tests that cannot be performed because there is data that is not collected. Who gets to use the collected data? Who generates the data and who is the data about (who does the data link to)?

Slippery definitions of “harm” The government's definition of what data is harmful to collect might be different from the reality. There's harm coming from data to the collectors, and there's harm for who keeps the questionable data.

Some users don't care about their privacy, and they can be used for all tests.

The time factor: what if this changes later and the privacy suddenly becomes important?

­Self­-selected users (can we let the users assess risks?) Users selected by the researcher (can the researcher/users assess the risks?)

For the sake of measurement, do we need to provide software that will protect users from something harmful?

Goal: Data that is safe and useful: sufficiently useful and safe enough for the collectors

Issues: Scaling of anonymizing/scrubbing Scrubbing of reporter/scrubbing of collector Collecting vs. Releasing (and access controls in between) Research data needs to be shared. Who do we want to share the data to? Other researchers in the field, and…?

What should the researchers use for their papers? If they use pre-­scrubbed data, there would be no responsibility. So, should scientists only use public data for papers?

Collected data is hard to keep secret. It could be interpreted by the media as “surveillance” -- (headline: “NSF: the NSA cousin?”)

ONI database is secret by default. But censorship measurement is different, it has real adversaries and those adversaries are very motivated to gain access to the collected data (and the data might not be particularly safe in a university).