Thank you for the reply, Markus. 

That is a very good point. The reason why I wanted to try this approach is 
because, even with being in a very large anonymous set in HTTP headers, the IP 
address network or region may be used split large sets down to individuals. I 
saw it in a paper, but I don't have it off hand. I thought, what about adding 
noise?

The nice thing about adding noise is that, no matter how much signal is picked 
up, all of the noise that is not filtered out is anonymizing. Yes, the noise 
patterns may become signal, but other noise can override that signal, too. 
Also, the process of filtering out can drop a lot of signal. 

It is nearly impossible to hide from an active, targeted, sophisticated 
surveillance, but "full-take", passive collection could be significantly 
hindered by small amounts of sophistication that breaks naive assumptions. 

Your suggestion is very good, and I am trying to build something like that, but 
with little affect on compatibility. Maybe collect the set of valid headers 
with large anonymity sets, and select a subset of headers that match the real 
configuration in only the most important features. That way, only obscure 
compatibility tests will fail. And have an option to provide the real 
user-agent string, when an issue happens. Afterall, if you only use the real 
one rarely, how can it be profiled? I suppose you could trick people to turn 
that on, but that is a fairly targeted action, not a full-take action, which is 
the primary issue.

I can collect a set of common user-agent strings, and can find a subset that 
are webkit, and use those. Since compatibility tests are usually about 
rendering engine, that would avoid most compatibility issues with a random 
user-agent. ‎Maybe provide the set of common user agent strings by rendering 
engine as a separate open source project. I deal with enough traffic to collect 
this myself. 

Websites rarely need to know if you are running Linux, but if you are going to 
download software, you can enable the correct OS to be sent. Besides, the 
correct API for websites should be that they request the browser to identify 
the OS, like they request device location, and the user accepts the request 
explicitly. It is such a rare need that not every website needs to know the 
operating system. ‎Browsers do this for credit card data (not that I would use 
it), they can have a form fill for operating system, too. When input 
name="operating_system", prompt to fill it. 

The "noise" I add to the accept-language header is easily identified as a new 
signal, so I am leaning toward abandoning it, but there are some interesting 
opportunities there. For example, it can be used when active surveillance is 
not an issue, but passive surveillance is an issue, to add friction to the 
passive surveillance machine. 

Ben
  Original Message  
From: Markus Teich
Sent: Wednesday, March 25, 2015 6:14 PM
To: dev@suckless.org
Reply To: dev mail list
Subject: Re: [dev] [surf] [patch] 13 patches from my Universal Same-Origin 
Policy branch

Nick wrote:
> - [PATCH 07/13] add random entropy to user-agent and accept-language headers.
> 
> I definitely like the idea, but wonder whether the solution in the patch is a
> bit overkill. After all, if we're basically just trying to defeat hashing
> correlations, then one random byte at the end of each variable should be
> enough. Also, unless I'm misreading it, am I correct in thinking the
> user-agent string is fully random? I'm currently using one from an oldish
> firefox, to reduce fingerprintability a bit, and I get annoying warnings on
> github and a few other places as a result - isn't it better to use a
> common-ish UA string with some random crap on the end, so most stupid websites
> won't do something annoying?

Heyho,

randomizing these headers at all rapidly shrinks the anonymity set size. Sure,
for a dumb adversary every request seems to come from another user, but a smart
adversary won't take long to detect these changes, filter them out and have a
nice list of all surf users (and browsers which use the same pattern, which
would probably be not many). When setting the headers to a very common value
(unfortunately I did not find _the_ most common UA and accept-language header
values), users are guaranteed to be part of a very huge anonymity set. If you
really want to randomize the headers, pick a pool of the most common values and
pick one of them at random. This can hower lead to different behaviour when
visiting a website twice.

I strongly advice against the randomization. It's also simpler in code to not
use it.

--Markus


Reply via email to