Re: [Python-Dev] Hash collision security issue (now public)
Victor Stinner writes: > Let's try to summarize this "vulnerability". > > The creation of a Python dictionary has a complexity of O(n) in most > cases, but O(n^2) in the *worst* case. The attack tries to go into the > worst case. It requires to compute a set of N keys having the same hash > value (hash(key1) == hash(key2) == ... hash(keyN)). It only has to > compute these keys once. It looks like it is now cheap enough in > practice to compute this dataset for Python (and other languages). I don't know the implementation issues well enough to claim it is a solution, but this hasn't been mentioned before AFAICS: While the dictionary probe has to start with a hash for backward compatibility reasons, is there a reason the overflow strategy for insertion has to be buckets containing lists? How about double-hashing, etc? ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
Am 31.12.2011 13:03, schrieb Stephen J. Turnbull: > I don't know the implementation issues well enough to claim it is a > solution, but this hasn't been mentioned before AFAICS: > > While the dictionary probe has to start with a hash for backward > compatibility reasons, is there a reason the overflow strategy for > insertion has to be buckets containing lists? How about > double-hashing, etc? Python's dict implementation doesn't use bucket but open addressing (aka closed hashed table). The algorithm for conflict resolution doesn't use double hashing. Instead it takes the original and (in most cases) cached hash and perturbs the hash with a series of add, multiply and bit shift ops. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
Zitat von Victor Stinner : The current implementation of dict uses a linked-list. [...] Tell me if I am wrong. At least with this statement, you are wrong: the current implementation does *not* use a linked-list. Instead, it uses open addressing. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
On Sat, Dec 31, 2011 at 7:03 AM, Stephen J. Turnbull wrote: > While the dictionary probe has to start with a hash for backward > compatibility reasons, is there a reason the overflow strategy for > insertion has to be buckets containing lists? How about > double-hashing, etc? > This won't help, because the keys still have the same hash value. ANYTHING you do to them after they're generated will result in them still colliding. The *only* thing that works is to change the hash function in such a way that the strings end up with different hashes in the first place. Otherwise, you'll still end up with (deliberate) collisions. (Well, technically, you could use trees or some other O log n data structure as a fallback once you have too many collisions, for some value of "too many". Seems a bit wasteful for the purpose, though.) ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
On Wed, Dec 28, 2011 at 5:37 PM, Jesse Noller wrote: > > > On Wednesday, December 28, 2011 at 8:28 PM, Michael Foord wrote: > >> Hello all, >> >> A paper (well, presentation) has been published highlighting security >> problems with the hashing algorithm (exploiting collisions) in many >> programming languages Python included: >> >> http://events.ccc.de/congress/2011/Fahrplan/attachments/2007_28C3_Effective_DoS_on_web_application_platforms.pdf >> >> Although it's a security issue I'm posting it here because it is now public >> and seems important. >> >> The issue they report can cause (for example) handling an http post to >> consume horrible amounts of cpu. For Python the figures they quoted: >> >> reasonable-sized attack strings only for 32 bits Plone has max. POST size of >> 1 MB >> 7 minutes of CPU usage for a 1 MB request >> ~20 kbits/s → keep one Core Duo core busy >> >> This was apparently reported to the security list, but hasn't been responded >> to beyond an acknowledgement on November 24th (the original report didn't >> make it onto the security list because it was held in a moderation queue). >> >> The same vulnerability was reported against various languages and web >> frameworks, and is already fixed in some of them. >> >> Their recommended fix is to randomize the hash function. >> >> All the best, >> >> Michael >> > Back up link for the PDF: > http://dl.dropbox.com/u/1374/2007_28C3_Effective_DoS_on_web_application_platforms.pdf > > Ocert disclosure: > http://www.ocert.org/advisories/ocert-2011-003.html Discussion of hash functions in general: http://burtleburtle.net/bob/hash/doobs.html Two of the best hash functions that currently exist: http://code.google.com/p/cityhash/ and http://code.google.com/p/smhasher/wiki/MurmurHash. I'm not sure exactly what problem the paper is primarily complaining about: 1. Multiply+add and multiply+xor hashes are weak: this would be solved by changing to either of the better-and-faster hashes I linked to above. On the other hand: http://mail.python.org/pipermail/python-3000/2007-September/010327.html 2. It's possible to find collisions in any hash function in a 32-bit space: only solved by picking a varying seed at startup or compile time. If you decide to change to a stronger hash function overall, it might also be useful to change the advice "to somehow mix together (e.g. using exclusive or) the hash values for the components" in http://docs.python.org/py3k/reference/datamodel.html#object.__hash__. hash(tuple(components)) will likely be better if tuple's hash is improved. Hash functions are already unstable across Python versions. Making them unstable across interpreter processes (multiprocessing doesn't share dicts, right?) doesn't sound like a big additional problem. Users who want a distributed hash table will need to pull their own hash function out of hashlib or re-implement a non-cryptographic hash instead of using the built-in one, but they probably need to do that already to allow themselves to upgrade Python. Jeffrey ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
On Sat, Dec 31, 2011 at 4:04 PM, Jeffrey Yasskin wrote: > Hash functions are already unstable across Python versions. Making > them unstable across interpreter processes (multiprocessing doesn't > share dicts, right?) doesn't sound like a big additional problem. > Users who want a distributed hash table will need to pull their own > hash function out of hashlib or re-implement a non-cryptographic hash > instead of using the built-in one, but they probably need to do that > already to allow themselves to upgrade Python. > Here's an idea. Suppose we add a sys.hash_seed or some such, that's settable to an int, and defaults to whatever we're using now. Then programs that want a fix can just set it to a random number, and on Python versions that support it, it takes effect. Everywhere else it's a silent no-op. Downside: sys has to have slots for this to work; does sys actually have slots? My memory's hazy on that. I guess actually it'd have to be sys.set_hash_seed(). But same basic idea. Anyway, this would make fixing the problem *possible*, while still pushing off the hard decisions to the app/framework developers. ;-) Downside: every hash operation includes one extra memory access, but strings only compute their hash once anyway.) Given that changing dict won't help, and changing the default hash is a non-starter, an option to set the seed is probably the way to go. (Maybe with an environment variable and/or command line option so users can work around old code.) ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
On 12/31/2011 4:43 PM, PJ Eby wrote: Here's an idea. Suppose we add a sys.hash_seed or some such, that's settable to an int, and defaults to whatever we're using now. Then programs that want a fix can just set it to a random number, I do not think we can allow that to change once there are hashed dictionaries existing. -- Terry Jan Reedy ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
ISTM the only reasonable thing is to have a random seed picked very early in the process, to be used to change the hash() function of str/bytes/unicode (in a way that they are still compatible with each other). The seed should be unique per process except it should survive fork() (but not exec()). I'm not worried about unrelated processes needing to have the same hash(), but I'm not against offering an env variable or command line flag to force the seed. I'm not too concerned about a 3rd party being able to guess the random seed -- this would require much more effort on their part, since they would have to generate a new set of colliding keys each time they think they have guessed the hash (as long as they can't force the seed -- this actually argues slightly *against* offering a way to force the seed, except that we have strong backwards compatibility requirements). We need to fix this as far back as Python 2.6, and it would be nice if a source patch was available that works on Python 2.5 -- personally I do have a need for a 2.5 fix and if nobody creates one I will probably end up backporting the fix from 2.6 to 2.5. Is there a tracker issue yet? The discussion should probably move there. PS. I would propose a specific fix but I can't seem to build a working CPython from the trunk on my laptop (OS X 10.6, Xcode 4.1). I get this error late in the build: ./python.exe -SE -m sysconfig --generate-posix-vars Fatal Python error: Py_Initialize: can't initialize sys standard streams Traceback (most recent call last): File "/Users/guido/cpython/Lib/io.py", line 60, in make: *** [Lib/_sysconfigdata.py] Abort trap -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
On Sat, Dec 31, 2011 at 4:56 PM, Guido van Rossum wrote: > PS. I would propose a specific fix but I can't seem to build a working > CPython from the trunk on my laptop (OS X 10.6, Xcode 4.1). I get this > error late in the build: > > ./python.exe -SE -m sysconfig --generate-posix-vars > Fatal Python error: Py_Initialize: can't initialize sys standard streams > Traceback (most recent call last): > File "/Users/guido/cpython/Lib/io.py", line 60, in > make: *** [Lib/_sysconfigdata.py] Abort trap > FWIW I managed to build Python 2.6, and a trivial mutation of the string/unicode hash function (add 1 initially) made only three tests fail; test_symtable and test_json both have a dependency on dictionary order, test_ctypes I can't quite figure out what's going on. Oh, and an unrelated failure in test_sqlite: File "/Users/guido/pythons/p26/Lib/sqlite3/test/types.py", line 355, in CheckSqlTimestamp self.failUnlessEqual(ts.year, now.year) AssertionError: 2012 != 2011 I betcha that's because it's still 2011 here in Texas but already 2012 in UTC-land. Happy New Year everyone! :-) -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
> I'm not too concerned about a 3rd party being able to guess the random seed > -- this would require much more effort on their part, since they would have > to generate a new set of colliding keys each time they think they have > guessed the hash This is incorrect. Once an attacker has guessed the random seed, any operation which reveals the ordering of hashed objects can be used to verify the answer. JSON responses would be ideal. In fact, an attacker can do a brute-force attack of the random seed offline. Once they have the seed, generating collisions is a fast process. The goal isn't perfection, but we need to do better than a simple salt. I propose we modify the string hash function like this: https://gist.github.com/0a91e52efa74f61858b5 This code is based on PyPy's implementation, but the concept is universal. Rather than choosing a single short random seed per process, we generate a much larger random seed (r). As we hash, we deterministically choose a portion of that seed and incorporate it into the hash process. This modification is a minimally intrusive change to the existing hash function, and so should not introduce unexpected side effects which might come from switching to a different class of hash functions. I've worked through this code with Alex Gaynor, Antoine Pitrou, and Victor Stinner, and have asked several mathematicians and security experts to review the concept. The reviewers who have gotten back to me thus far have agreed that if the initial random seed is not flawed, this should not overly change the properties of the hash function, but should make it quite difficult for an attacker to deduce the necessary information to predictably cause hash collisions. This function is not designed to protect against timing attacks, but should be nontrivial to reverse even with access to timing data. Empirical testing shows that this unoptimized python implementation produces ~10% slowdown in the hashing of ~20 character strings. This is probably an acceptable trade off, and actually provides better performance in the case of short strings than a high-entropy fixed-length seed prefix. -Paul ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
(Well, technically, you could use trees or some other O log n data structure as a fallback once you have too many collisions, for some value of "too many". Seems a bit wasteful for the purpose, though.) I don't think that would be wasteful. You wouldn't just use the tree for the case of too many collisions, but for any collision. You might special-case the case of a single key, i.e. start using the tree only if there is a collision. The issue is not the effort, but the need to support ordering if you want to use trees. So you might restrict this to dicts that have only str keys (which in practice should never have any collision, unless it's a deliberate setup). I'd use the tagged-pointer trick to determine whether a key is an object pointer (tag 0) or an AVL tree (tag 1). So in the common case of interned strings, the comparison for pointer equality (which is the normal case if the keys are interned) will succeed quickly; if pointer comparison fails, check the tag bit. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
On Sat, 31 Dec 2011 16:56:00 -0700 Guido van Rossum wrote: > ISTM the only reasonable thing is to have a random seed picked very early > in the process, to be used to change the hash() function of > str/bytes/unicode (in a way that they are still compatible with each other). Do str and bytes still have to be compatible with each other in 3.x? Merry hashes, weakrefs and thread-local memoryviews to everyone! cheers Antoine. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
On Sat, Dec 31, 2011 at 9:11 PM, Antoine Pitrou wrote: > On Sat, 31 Dec 2011 16:56:00 -0700 > Guido van Rossum wrote: > > ISTM the only reasonable thing is to have a random seed picked very early > > in the process, to be used to change the hash() function of > > str/bytes/unicode (in a way that they are still compatible with each > other). > > Do str and bytes still have to be compatible with each other in 3.x? > Hm, you're right, that's no longer a concern. (Though ATM the hashes still *are* compatible.) > Merry hashes, weakrefs and thread-local memoryviews to everyone! > :-) -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
On Sat, Dec 31, 2011 at 8:29 PM, Paul McMillan wrote: > > I'm not too concerned about a 3rd party being able to guess the random > seed > > -- this would require much more effort on their part, since they would > have > > to generate a new set of colliding keys each time they think they have > > guessed the hash > > This is incorrect. Once an attacker has guessed the random seed, any > operation which reveals the ordering of hashed objects can be used to > verify the answer. JSON responses would be ideal. In fact, an attacker > can do a brute-force attack of the random seed offline. Once they have > the seed, generating collisions is a fast process. > Still, it would represent an effort for the attacker of a much greater magnitude than the current attack. It's all a trade-off -- at some point it'll just be easier for the attacker to use some other vulnerability. Also the class of vulnerable servers would be greatly reduced. > The goal isn't perfection, but we need to do better than a simple > salt. Perhaps. > I propose we modify the string hash function like this: > > https://gist.github.com/0a91e52efa74f61858b5 > > This code is based on PyPy's implementation, but the concept is > universal. Rather than choosing a single short random seed per > process, we generate a much larger random seed (r). As we hash, we > deterministically choose a portion of that seed and incorporate it > into the hash process. This modification is a minimally intrusive > change to the existing hash function, and so should not introduce > unexpected side effects which might come from switching to a different > class of hash functions. > I'm not sure I understand this. What's the worry about "a different class of hash functions"? (It may be clear that I do not have a deep mathematical understanding of hash functions.) > I've worked through this code with Alex Gaynor, Antoine Pitrou, and > Victor Stinner, and have asked several mathematicians and security > experts to review the concept. The reviewers who have gotten back to > me thus far have agreed that if the initial random seed is not flawed, I forget -- what do we do on systems without urandom()? (E.g. Windows?) > this should not overly change the properties of the hash function, but > should make it quite difficult for an attacker to deduce the necessary > information to predictably cause hash collisions. This function is not > designed to protect against timing attacks, but should be nontrivial > to reverse even with access to timing data. > Let's worry about timing attacks another time okay? > Empirical testing shows that this unoptimized python implementation > produces ~10% slowdown in the hashing of ~20 character strings. This > is probably an acceptable trade off, and actually provides better > performance in the case of short strings than a high-entropy > fixed-length seed prefix. > Hm. I'm not sure I like the idea of extra arithmetic for every character being hashed. But I like the idea of a bigger random seed from which we deterministically pick some part. How about just initializing x to some subsequence of the seed determined by e.g. the length of the hashed string plus a few characters from it? -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
> Still, it would represent an effort for the attacker of a much greater > magnitude than the current attack. It's all a trade-off -- at some point > it'll just be easier for the attacker to use some other vulnerability. Also > the class of vulnerable servers would be greatly reduced. I agree that doing anything is better than doing nothing. If we use the earlier suggestion and prepend everything with a fixed-length seed, we need quite a bit of entropy (and so a fairly long string) to make a lasting difference. > I'm not sure I understand this. What's the worry about "a different class of > hash functions"? (It may be clear that I do not have a deep mathematical > understanding of hash functions.) This was mostly in reference to earlier suggestions of switching to cityhash, or using btrees, or other more invasive changes. Python 2.X is pretty stable and making large changes like that to the codebase can have unpredictable effects. We know that the current hash function works well (except for this specific problem), so it seems like the best fix will be as minimal a modification as possible, to avoid introducing bugs. > I forget -- what do we do on systems without urandom()? (E.g. Windows?) Windows has CryptGenRandom which is approximately equivalent. > Let's worry about timing attacks another time okay? Agreed. As long as there isn't a gaping hole, I'm fine with that. > Hm. I'm not sure I like the idea of extra arithmetic for every character > being hashed. >From a performance standpoint, this may still be better than adding 8 or 10 characters to every single hash operation, since most hashes are over short strings. It is important that this function touches every character - if it only interacts with a subset of them, an attacker can fix that subset and vary the rest. > But I like the idea of a bigger random seed from which we > deterministically pick some part. Yeah. This makes it much harder to attack, since it very solidly places the attacker outside the realm of "just brute force the key". > How about just initializing x to some > subsequence of the seed determined by e.g. the length of the hashed string > plus a few characters from it? We did consider this, and if performance is absolutely the prime directive, this (or a variant) may be the best option. Unfortunately, the collision generator doesn't necessarily vary the length of the string. Additionally, if we don't vary based on all the letters in the string, an attacker can fix the characters that we do use and generate colliding strings around them. Another option to consider would be to apply this change to some but not all of the rounds. If we added the seed lookup xor operation for only the first and last 5 values of x, we would still retain much of the benefit without adding much computational overhead for very long strings. We could also consider a less computationally expensive operation than the modulo for calculating the lookup index, like simply truncating to the correct number of bits. -Paul ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
