On Thu, Feb 26, 2015 at 03:13:19AM +0000, Stephen Farrell wrote:
> One issue I don't think is covered well enough is the potential
> (I don't know if this is actual) risk of re-identification via
> sets of DNS queries.
I've been studying the twin problems of anonymization and
re-identification attacks for the past several years, and would
like to add a general comment about this issue.
One study after another, in disparate fields, has shown that
data that everyone agrees is anonymized, isn't really -- and thus is
subject to re-identification attacks. Sometimes those attacks can
be conducted based solely on the putatively-anonymized data set;
sometimes they can be conducted by combining that data set with
other ones. And one of the disturbing things about that latter
circumstance is that "other ones" isn't necessarily restricted to
data sets which are public: the ongoing parade of massive data
leaks (e.g., Target, Home Depot, Anthem) has ensured that potential
attackers have a plethora of resources to choose from, depending
on what's been made available, what it costs, what their resources
are, etc.
A second disturbing thing is that the quantity of data required
to conduct such attacks is often surprisingly small. The first
link below demonstrates a re-identification attack that is successful
in about 90% of cases...but relies on only four (4) data points.
Some illustrative examples:
Yet Another Report Showing "Anonymous" Data Not At All Anonymous
https://www.techdirt.com/articles/20150209/06111829955/yet-another-report-showing-anonymous-data-not-all-anonymous.shtml
Anonymity and the Netflix Dataset
https://www.schneier.com/blog/archives/2007/12/anonymity_and_t_2.html
Poorly anonymized logs reveal NYC cab drivers' detailed whereabouts
http://arstechnica.com/tech-policy/2014/06/poorly-anonymized-logs-reveal-nyc-cab-drivers-detailed-whereabouts/
A Systematic Review of Re-Identification Attacks on Health Data
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0028071
One person whose work I've been following in this regard is
Arvind Narayanan. He's at Princeton, and he has a blog here:
http://33bits.org/
There are many thoughtful entries in that blog, but I think one of the
best is this one:
De-anonymization is not X: The Need for Re-identification Science
http://33bits.org/2009/10/14/de-anonymization-is-not-x-the-need-for-re-identification-science/
My take, based on everything I've read, is that we're only at the
beginning of our understanding this problem. We really do, as that
last blog entry points out, need science in this area, and we don't have
it yet. But based on what we *do* know to date, it seems to be prudent
to use the working assumptions that (1) nearly all data sets are not as
anonymized as we'd like to think they are and thus (2) nearly all data
sets are subject to some form of re-identication attack and (3) those
re-identification attacks may be alarmingly successful even if they use
very sparse data.
Now as to applicability in this particular area: I have been thinking a
great deal about something related that I'll glibly label as "Auto-update
considered harmful" for lack of a better term. Auto-update mechanisms
generate very predictable patterns of DNS queries and other traffic
(likely: HTTP/HTTPS), and I hypothesize that it's possible to fingerprint
individual systems and users based on those. I further hypothesize that
it's possible to ascertain not only what software users have installed,
but what versions of software they have installed, *even if the traffic
is encrypted*. But these are nascent hypotheses and I've yet to devise
an experimental methology to test them, so for purposes of dns-privacy,
I'll just say that Stephen's point is a VERY good one and fully merits
the attention of this group.
---rsk
_______________________________________________
dns-privacy mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/dns-privacy