Re: [Python-Dev] Hash collision security issue (now public)

2011-12-31 Thread Stephen J. Turnbull
Victor Stinner writes:

 > Let's try to summarize this "vulnerability".
 > 
 > The creation of a Python dictionary has a complexity of O(n) in most 
 > cases, but O(n^2) in the *worst* case. The attack tries to go into the 
 > worst case. It requires to compute a set of N keys having the same hash 
 > value (hash(key1) == hash(key2) == ... hash(keyN)). It only has to 
 > compute these keys once. It looks like it is now cheap enough in 
 > practice to compute this dataset for Python (and other languages).

I don't know the implementation issues well enough to claim it is a
solution, but this hasn't been mentioned before AFAICS:

While the dictionary probe has to start with a hash for backward
compatibility reasons, is there a reason the overflow strategy for
insertion has to be buckets containing lists?  How about
double-hashing, etc?
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2011-12-31 Thread Christian Heimes
Am 31.12.2011 13:03, schrieb Stephen J. Turnbull:
> I don't know the implementation issues well enough to claim it is a
> solution, but this hasn't been mentioned before AFAICS:
> 
> While the dictionary probe has to start with a hash for backward
> compatibility reasons, is there a reason the overflow strategy for
> insertion has to be buckets containing lists?  How about
> double-hashing, etc?

Python's dict implementation doesn't use bucket but open addressing (aka
closed hashed table). The algorithm for conflict resolution doesn't use
double hashing. Instead it takes the original and (in most cases) cached
hash and perturbs the hash with a series of add, multiply and bit shift ops.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2011-12-31 Thread martin


Zitat von Victor Stinner :


The current implementation of dict uses a linked-list.

[...]

Tell me if I am wrong.


At least with this statement, you are wrong: the current
implementation does *not* use a linked-list. Instead, it
uses open addressing.

Regards,
Martin



___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2011-12-31 Thread PJ Eby
On Sat, Dec 31, 2011 at 7:03 AM, Stephen J. Turnbull wrote:

> While the dictionary probe has to start with a hash for backward
> compatibility reasons, is there a reason the overflow strategy for
> insertion has to be buckets containing lists?  How about
> double-hashing, etc?
>

This won't help, because the keys still have the same hash value. ANYTHING
you do to them after they're generated will result in them still colliding.

The *only* thing that works is to change the hash function in such a way
that the strings end up with different hashes in the first place.
 Otherwise, you'll still end up with (deliberate) collisions.

(Well, technically, you could use trees or some other O log n data
structure as a fallback once you have too many collisions, for some value
of "too many".  Seems a bit wasteful for the purpose, though.)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2011-12-31 Thread Jeffrey Yasskin
On Wed, Dec 28, 2011 at 5:37 PM, Jesse Noller  wrote:
>
>
> On Wednesday, December 28, 2011 at 8:28 PM, Michael Foord wrote:
>
>> Hello all,
>>
>> A paper (well, presentation) has been published highlighting security 
>> problems with the hashing algorithm (exploiting collisions) in many 
>> programming languages Python included:
>>
>> http://events.ccc.de/congress/2011/Fahrplan/attachments/2007_28C3_Effective_DoS_on_web_application_platforms.pdf
>>
>> Although it's a security issue I'm posting it here because it is now public 
>> and seems important.
>>
>> The issue they report can cause (for example) handling an http post to 
>> consume horrible amounts of cpu. For Python the figures they quoted:
>>
>> reasonable-sized attack strings only for 32 bits Plone has max. POST size of 
>> 1 MB
>> 7 minutes of CPU usage for a 1 MB request
>> ~20 kbits/s → keep one Core Duo core busy
>>
>> This was apparently reported to the security list, but hasn't been responded 
>> to beyond an acknowledgement on November 24th (the original report didn't 
>> make it onto the security list because it was held in a moderation queue).
>>
>> The same vulnerability was reported against various languages and web 
>> frameworks, and is already fixed in some of them.
>>
>> Their recommended fix is to randomize the hash function.
>>
>> All the best,
>>
>> Michael
>>
> Back up link for the PDF:
> http://dl.dropbox.com/u/1374/2007_28C3_Effective_DoS_on_web_application_platforms.pdf
>
> Ocert disclosure:
> http://www.ocert.org/advisories/ocert-2011-003.html

Discussion of hash functions in general:
http://burtleburtle.net/bob/hash/doobs.html
Two of the best hash functions that currently exist:
http://code.google.com/p/cityhash/ and
http://code.google.com/p/smhasher/wiki/MurmurHash.

I'm not sure exactly what problem the paper is primarily complaining about:
1. Multiply+add and multiply+xor hashes are weak: this would be solved
by changing to either of the better-and-faster hashes I linked to
above. On the other hand:
http://mail.python.org/pipermail/python-3000/2007-September/010327.html
2. It's possible to find collisions in any hash function in a 32-bit
space: only solved by picking a varying seed at startup or compile
time.

If you decide to change to a stronger hash function overall, it might
also be useful to change the advice "to somehow mix together (e.g.
using exclusive or) the hash values for the components" in
http://docs.python.org/py3k/reference/datamodel.html#object.__hash__.
hash(tuple(components)) will likely be better if tuple's hash is
improved.

Hash functions are already unstable across Python versions. Making
them unstable across interpreter processes (multiprocessing doesn't
share dicts, right?) doesn't sound like a big additional problem.
Users who want a distributed hash table will need to pull their own
hash function out of hashlib or re-implement a non-cryptographic hash
instead of using the built-in one, but they probably need to do that
already to allow themselves to upgrade Python.

Jeffrey
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2011-12-31 Thread PJ Eby
On Sat, Dec 31, 2011 at 4:04 PM, Jeffrey Yasskin  wrote:

> Hash functions are already unstable across Python versions. Making
> them unstable across interpreter processes (multiprocessing doesn't
> share dicts, right?) doesn't sound like a big additional problem.
> Users who want a distributed hash table will need to pull their own
> hash function out of hashlib or re-implement a non-cryptographic hash
> instead of using the built-in one, but they probably need to do that
> already to allow themselves to upgrade Python.
>

Here's an idea.  Suppose we add a sys.hash_seed or some such, that's
settable to an int, and defaults to whatever we're using now.  Then
programs that want a fix can just set it to a random number, and on Python
versions that support it, it takes effect.  Everywhere else it's a silent
no-op.

Downside: sys has to have slots for this to work; does sys actually have
slots?  My memory's hazy on that.  I guess actually it'd have to be
sys.set_hash_seed().  But same basic idea.

Anyway, this would make fixing the problem *possible*, while still pushing
off the hard decisions to the app/framework developers.  ;-)

Downside: every hash operation includes one extra memory access, but
strings only compute their hash once anyway.)

Given that changing dict won't help, and changing the default hash is a
non-starter, an option to set the seed is probably the way to go.  (Maybe
with an environment variable and/or command line option so users can work
around old code.)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2011-12-31 Thread Terry Reedy

On 12/31/2011 4:43 PM, PJ Eby wrote:


Here's an idea.  Suppose we add a sys.hash_seed or some such, that's
settable to an int, and defaults to whatever we're using now.  Then
programs that want a fix can just set it to a random number,


I do not think we can allow that to change once there are hashed 
dictionaries existing.


--
Terry Jan Reedy

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2011-12-31 Thread Guido van Rossum
ISTM the only reasonable thing is to have a random seed picked very early
in the process, to be used to change the hash() function of
str/bytes/unicode (in a way that they are still compatible with each other).

The seed should be unique per process except it should survive fork() (but
not exec()). I'm not worried about unrelated processes needing to have the
same hash(), but I'm not against offering an env variable or command line
flag to force the seed.

I'm not too concerned about a 3rd party being able to guess the random seed
-- this would require much more effort on their part, since they would have
to generate a new set of colliding keys each time they think they have
guessed the hash (as long as they can't force the seed -- this actually
argues slightly *against* offering a way to force the seed, except that we
have strong backwards compatibility requirements).

We need to fix this as far back as Python 2.6, and it would be nice if a
source patch was available that works on Python 2.5 -- personally I do have
a need for a 2.5 fix and if nobody creates one I will probably end up
backporting the fix from 2.6 to 2.5.

Is there a tracker issue yet? The discussion should probably move there.

PS. I would propose a specific fix but I can't seem to build a working
CPython from the trunk on my laptop (OS X 10.6, Xcode 4.1). I get this
error late in the build:

./python.exe -SE -m sysconfig --generate-posix-vars
Fatal Python error: Py_Initialize: can't initialize sys standard streams
Traceback (most recent call last):
  File "/Users/guido/cpython/Lib/io.py", line 60, in 
make: *** [Lib/_sysconfigdata.py] Abort trap

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2011-12-31 Thread Guido van Rossum
On Sat, Dec 31, 2011 at 4:56 PM, Guido van Rossum  wrote:

> PS. I would propose a specific fix but I can't seem to build a working
> CPython from the trunk on my laptop (OS X 10.6, Xcode 4.1). I get this
> error late in the build:
>
> ./python.exe -SE -m sysconfig --generate-posix-vars
> Fatal Python error: Py_Initialize: can't initialize sys standard streams
> Traceback (most recent call last):
>   File "/Users/guido/cpython/Lib/io.py", line 60, in 
> make: *** [Lib/_sysconfigdata.py] Abort trap
>

FWIW I managed to build Python 2.6, and a trivial mutation of the
string/unicode hash function (add 1 initially) made only three tests fail;
test_symtable and test_json both have a dependency on dictionary order,
test_ctypes I can't quite figure out what's going on.

Oh, and an unrelated failure in test_sqlite:

  File "/Users/guido/pythons/p26/Lib/sqlite3/test/types.py", line 355, in
CheckSqlTimestamp
self.failUnlessEqual(ts.year, now.year)
AssertionError: 2012 != 2011

I betcha that's because it's still 2011 here in Texas but already 2012 in
UTC-land. Happy New Year everyone! :-)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2011-12-31 Thread Paul McMillan
> I'm not too concerned about a 3rd party being able to guess the random seed
> -- this would require much more effort on their part, since they would have
> to generate a new set of colliding keys each time they think they have
> guessed the hash

This is incorrect. Once an attacker has guessed the random seed, any
operation which reveals the ordering of hashed objects can be used to
verify the answer. JSON responses would be ideal. In fact, an attacker
can do a brute-force attack of the random seed offline. Once they have
the seed, generating collisions is a fast process.

The goal isn't perfection, but we need to do better than a simple
salt. I propose we modify the string hash function like this:

https://gist.github.com/0a91e52efa74f61858b5

This code is based on PyPy's implementation, but the concept is
universal. Rather than choosing a single short random seed per
process, we generate a much larger random seed (r). As we hash, we
deterministically choose a portion of that seed and incorporate it
into the hash process. This modification is a minimally intrusive
change to the existing hash function, and so should not introduce
unexpected side effects which might come from switching to a different
class of hash functions.

I've worked through this code with Alex Gaynor, Antoine Pitrou, and
Victor Stinner, and have asked several mathematicians and security
experts to review the concept. The reviewers who have gotten back to
me thus far have agreed that if the initial random seed is not flawed,
this should not overly change the properties of the hash function, but
should make it quite difficult for an attacker to deduce the necessary
information to predictably cause hash collisions. This function is not
designed to protect against timing attacks, but should be nontrivial
to reverse even with access to timing data.

Empirical testing shows that this unoptimized python implementation
produces ~10% slowdown in the hashing of ~20 character strings. This
is probably an acceptable trade off, and actually provides better
performance in the case of short strings than a high-entropy
fixed-length seed prefix.

-Paul
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2011-12-31 Thread martin

(Well, technically, you could use trees or some other O log n data
structure as a fallback once you have too many collisions, for some value
of "too many".  Seems a bit wasteful for the purpose, though.)


I don't think that would be wasteful. You wouldn't just use the tree for
the case of too many collisions, but for any collision. You might special-case
the case of a single key, i.e. start using the tree only if there is a
collision.

The issue is not the effort, but the need to support ordering if you want
to use trees. So you might restrict this to dicts that have only str keys
(which in practice should never have any collision, unless it's a deliberate
setup).

I'd use the tagged-pointer trick to determine whether a key is an object
pointer (tag 0) or an AVL tree (tag 1). So in the common case of interned
strings, the comparison for pointer equality (which is the normal case
if the keys are interned) will succeed quickly; if pointer comparison fails,
check the tag bit.

Regards,
Martin




___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2011-12-31 Thread Antoine Pitrou
On Sat, 31 Dec 2011 16:56:00 -0700
Guido van Rossum  wrote:
> ISTM the only reasonable thing is to have a random seed picked very early
> in the process, to be used to change the hash() function of
> str/bytes/unicode (in a way that they are still compatible with each other).

Do str and bytes still have to be compatible with each other in 3.x?

Merry hashes, weakrefs and thread-local memoryviews to everyone!

cheers

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2011-12-31 Thread Guido van Rossum
On Sat, Dec 31, 2011 at 9:11 PM, Antoine Pitrou  wrote:

> On Sat, 31 Dec 2011 16:56:00 -0700
> Guido van Rossum  wrote:
> > ISTM the only reasonable thing is to have a random seed picked very early
> > in the process, to be used to change the hash() function of
> > str/bytes/unicode (in a way that they are still compatible with each
> other).
>
> Do str and bytes still have to be compatible with each other in 3.x?
>

Hm, you're right, that's no longer a concern. (Though ATM the hashes still
*are* compatible.)


> Merry hashes, weakrefs and thread-local memoryviews to everyone!
>

:-)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2011-12-31 Thread Guido van Rossum
On Sat, Dec 31, 2011 at 8:29 PM, Paul McMillan  wrote:

> > I'm not too concerned about a 3rd party being able to guess the random
> seed
> > -- this would require much more effort on their part, since they would
> have
> > to generate a new set of colliding keys each time they think they have
> > guessed the hash
>
> This is incorrect. Once an attacker has guessed the random seed, any
> operation which reveals the ordering of hashed objects can be used to
> verify the answer. JSON responses would be ideal. In fact, an attacker
> can do a brute-force attack of the random seed offline. Once they have
> the seed, generating collisions is a fast process.
>

Still, it would represent an effort for the attacker of a much greater
magnitude than the current attack. It's all a trade-off -- at some point
it'll just be easier for the attacker to use some other vulnerability. Also
the class of vulnerable servers would be greatly reduced.


> The goal isn't perfection, but we need to do better than a simple
> salt.


Perhaps.


> I propose we modify the string hash function like this:
>
> https://gist.github.com/0a91e52efa74f61858b5
>
> This code is based on PyPy's implementation, but the concept is
> universal. Rather than choosing a single short random seed per
> process, we generate a much larger random seed (r). As we hash, we
> deterministically choose a portion of that seed and incorporate it
> into the hash process. This modification is a minimally intrusive
> change to the existing hash function, and so should not introduce
> unexpected side effects which might come from switching to a different
> class of hash functions.
>

I'm not sure I understand this. What's the worry about "a different class
of hash functions"? (It may be clear that I do not have a deep mathematical
understanding of hash functions.)


> I've worked through this code with Alex Gaynor, Antoine Pitrou, and
> Victor Stinner, and have asked several mathematicians and security
> experts to review the concept. The reviewers who have gotten back to
> me thus far have agreed that if the initial random seed is not flawed,


I forget -- what do we do on systems without urandom()? (E.g. Windows?)


> this should not overly change the properties of the hash function, but
> should make it quite difficult for an attacker to deduce the necessary
> information to predictably cause hash collisions. This function is not
> designed to protect against timing attacks, but should be nontrivial
> to reverse even with access to timing data.
>

Let's worry about timing attacks another time okay?


> Empirical testing shows that this unoptimized python implementation
> produces ~10% slowdown in the hashing of ~20 character strings. This
> is probably an acceptable trade off, and actually provides better
> performance in the case of short strings than a high-entropy
> fixed-length seed prefix.
>

Hm. I'm not sure I like the idea of extra arithmetic for every character
being hashed. But I like the idea of a bigger random seed from which we
deterministically pick some part. How about just initializing x to some
subsequence of the seed determined by e.g. the length of the hashed string
plus a few characters from it?

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2011-12-31 Thread Paul McMillan
> Still, it would represent an effort for the attacker of a much greater
> magnitude than the current attack. It's all a trade-off -- at some point
> it'll just be easier for the attacker to use some other vulnerability. Also
> the class of vulnerable servers would be greatly reduced.

I agree that doing anything is better than doing nothing. If we use
the earlier suggestion and prepend everything with a fixed-length
seed, we need quite a bit of entropy (and so a fairly long string) to
make a lasting difference.

> I'm not sure I understand this. What's the worry about "a different class of
> hash functions"? (It may be clear that I do not have a deep mathematical
> understanding of hash functions.)

This was mostly in reference to earlier suggestions of switching to
cityhash, or using btrees, or other more invasive changes. Python 2.X
is pretty stable and making large changes like that to the codebase
can have unpredictable effects. We know that the current hash function
works well (except for this specific problem), so it seems like the
best fix will be as minimal a modification as possible, to avoid
introducing bugs.

> I forget -- what do we do on systems without urandom()? (E.g. Windows?)
Windows has CryptGenRandom which is approximately equivalent.

> Let's worry about timing attacks another time okay?
Agreed. As long as there isn't a gaping hole, I'm fine with that.

> Hm. I'm not sure I like the idea of extra arithmetic for every character
> being hashed.

>From a performance standpoint, this may still be better than adding 8
or 10 characters to every single hash operation, since most hashes are
over short strings. It is important that this function touches every
character - if it only interacts with a subset of them, an attacker
can fix that subset and vary the rest.

> But I like the idea of a bigger random seed from which we
> deterministically pick some part.

Yeah. This makes it much harder to attack, since it very solidly
places the attacker outside the realm of "just brute force the key".

> How about just initializing x to some
> subsequence of the seed determined by e.g. the length of the hashed string
> plus a few characters from it?

We did consider this, and if performance is absolutely the prime
directive, this (or a variant) may be the best option. Unfortunately,
the collision generator doesn't necessarily vary the length of the
string. Additionally, if we don't vary based on all the letters in the
string, an attacker can fix the characters that we do use and generate
colliding strings around them.

Another option to consider would be to apply this change to some but
not all of the rounds. If we added the seed lookup xor operation for
only the first and last 5 values of x, we would still retain much of
the benefit without adding much computational overhead for very long
strings.

We could also consider a less computationally expensive operation than
the modulo for calculating the lookup index, like simply truncating to
the correct number of bits.

-Paul
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com