On Tue, Aug 29, 2017 at 5:13 PM, Kurt Roeckx via governance < governance@lists.mozilla.org> wrote:
> On 2017-08-29 15:50, Georg Fritzsche wrote: > >> On Thu, Aug 24, 2017 at 10:23 AM, Kurt Roeckx via governance < >> governance@lists.mozilla.org> wrote: >> >> On 2017-08-23 16:33, Alex Gaynor wrote: >>> >>> I had the same question, but it looks like RAPPOR has gotten >>>> significantly >>>> more advanced since I originally learned about the "just boolean >>>> questions" >>>> version. https://arxiv.org/pdf/1503.01214.pdf explains how to build >>>> privacy >>>> preserving measurements without knowing the values of the population. >>>> >>>> >>> So if I understand things correctly from the paper, you create a bloom >>> filter for the URL/hostname you want to send, then randomly change it, >>> store that. And each time they ask about the URL/hostname you take the >>> stored version, randomly change it and that's what you send. >>> >>> What I understand from that is that you don't get to learn the >>> URL/hostname at all, but can query if a URL/hostname has been submitted. >>> You don't get to learn what the population is, but the whole population >>> can >>> be send. >>> >>> Is that accurate? >>> >>> >> Hi, >> >> through RAPPOR, we can send randomized values for all encountered domain >> values. >> >> Then, in analysis, we can test the noisy aggregate data against known >> domain values and get an estimate of how frequently they occurred. >> >> This gives immediate insights and we can increase the detail by adding >> more >> sources for known domain values. >> > > The paper has several algorithms in it. The first is described in "II. > BACKGROUND", which does not allow you to learn the dictionary, but you can > check that certain URLs are in it or not. > > Then in "III. ESTIMATING JOINT DISTRIBUTIONS" they describe how you can > correlate different answers with each other. > > Then in "IV. RAPPOR WITHOUT A KNOWN DICTIONARY" they describe that you can > send some additional data, and then using the algorithm from III to learn > something about the dictionary. > > Do you intend to use the algorithm from II or from IV? > > From what I understand, for the algorithm of II there are various > parameters that affect the noise, and how likely it is someone can learn > something about the data you're sending. I think they at least include: > - The size of the bloom filter > - The number of hashes you use > - probability of the randomization for the PRR (f in the paper) > - probability of the randomization for the IRR (q and p from the paper) > > Do you have any idea which you plan to use, and what the effect of that is? > The referenced paper is a newer one ("Building RAPPOR with the unknown [...]"). Our current work is based on the first paper <https://arxiv.org/pdf/1407.6981.pdf>. The anonymization part is described in paragraphs 3.1 and 3.2. The aggregation/decoding is described in 4. We will publish a summary of the technical details if we decide to move forward with this. We are also working on a blog post that will share the best practices and approaches that we found from working on this. Georg _______________________________________________ governance mailing list governance@lists.mozilla.org https://lists.mozilla.org/listinfo/governance