On 2017-08-29 15:50, Georg Fritzsche wrote:
On Thu, Aug 24, 2017 at 10:23 AM, Kurt Roeckx via governance <
governance@lists.mozilla.org> wrote:
On 2017-08-23 16:33, Alex Gaynor wrote:
I had the same question, but it looks like RAPPOR has gotten significantly
more advanced since I originally learned about the "just boolean
questions"
version. https://arxiv.org/pdf/1503.01214.pdf explains how to build
privacy
preserving measurements without knowing the values of the population.
So if I understand things correctly from the paper, you create a bloom
filter for the URL/hostname you want to send, then randomly change it,
store that. And each time they ask about the URL/hostname you take the
stored version, randomly change it and that's what you send.
What I understand from that is that you don't get to learn the
URL/hostname at all, but can query if a URL/hostname has been submitted.
You don't get to learn what the population is, but the whole population can
be send.
Is that accurate?
Hi,
through RAPPOR, we can send randomized values for all encountered domain
values.
Then, in analysis, we can test the noisy aggregate data against known
domain values and get an estimate of how frequently they occurred.
This gives immediate insights and we can increase the detail by adding more
sources for known domain values.
The paper has several algorithms in it. The first is described in "II.
BACKGROUND", which does not allow you to learn the dictionary, but you
can check that certain URLs are in it or not.
Then in "III. ESTIMATING JOINT DISTRIBUTIONS" they describe how you can
correlate different answers with each other.
Then in "IV. RAPPOR WITHOUT A KNOWN DICTIONARY" they describe that you
can send some additional data, and then using the algorithm from III to
learn something about the dictionary.
Do you intend to use the algorithm from II or from IV?
From what I understand, for the algorithm of II there are various
parameters that affect the noise, and how likely it is someone can learn
something about the data you're sending. I think they at least include:
- The size of the bloom filter
- The number of hashes you use
- probability of the randomization for the PRR (f in the paper)
- probability of the randomization for the IRR (q and p from the paper)
Do you have any idea which you plan to use, and what the effect of that is?
Kurt
_______________________________________________
governance mailing list
governance@lists.mozilla.org
https://lists.mozilla.org/listinfo/governance