On Sun, Aug 27, 2017 at 2:47 PM, David Bruant <bruan...@gmail.com> wrote:
> Asks for sensitive data center most commonly around knowing something in >> relation to which sites a user visits: >> >> - >> >> "Which top sites are users visiting?" >> - >> >> "Which sites using Flash does a user encounter?" >> - >> >> "Which sites does a user see heavy Jank on?" >> >> In summary most asks are for occurrences of an event X per domain (more >> specifically eTLD+1 [1], e.g. facebook.com or google.co.uk). >> >> The solution. >> >> One solution is the use of differential privacy [2] [3], which allows us >> to >> collect sensitive data without being able to make conclusions about >> individual users, thus preserving their privacy. >> >> An attacker that has access to the data a single user submits is not able >> to tell whether a specific site was visited by that user or not. >> > Just to be 100% sure i understand, what will happen is that Firefox will > lie (or answer randomly) to the question with probability p. This way, even > if an attacker reaches to Moz servers, they can trust the answer only with > probability 1-p. > There is a trade-off between utility (low p) and stronger privacy (high p). > Could this trade-off be documented and a hard low limit be decided? > Should each study decide on a different p based on data sensitivity? > Yes, once the value is encoded we will lie or answer randomly about the status of each bit with a certain probability. This probability depends on a prior state of the bloom filter which holds potential responses. it was a 1 or a 0 and on all the parameters of the RAPPOR algorithm. As an end-result we effectively constrain it to 1-p. As your intuition correctly suggests, there is a balance between utility and privacy. Our goal is to choose parameters such that the privacy of users is assured, while also getting statistical insights from the aggregate data. The privacy guarantee is expressed in terms of the ε parameter. For RAPPOR this takes into account the addition of noise on the client-side via the “lying” mechanism described above. Depending on the data sensitivity, the population size and the collection frequency (one-time or repeated) the ε-level should be fixed and the appropriate set of parameters need to be tuned. Under some circumstances this may mean that useful data may not be collected, in which case user privacy is still preserved. This parameter choice should be transparently documented and we need to establish hard limits as well as best practices around choosing them. We are working on a blog post that will share the best practices and approaches that we found. Georg _______________________________________________ governance mailing list governance@lists.mozilla.org https://lists.mozilla.org/listinfo/governance