Marc-Andre Lemburg <m...@egenix.com> added the comment: Before continuing down the road of adding randomness to hash functions, please have a good read of the existing dictionary implementation:
""" Major subtleties ahead: Most hash schemes depend on having a "good" hash function, in the sense of simulating randomness. Python doesn't: its most important hash functions (for strings and ints) are very regular in common cases: [0, 1, 2, 3] >>> map(hash, ("namea", "nameb", "namec", "named")) [-1658398457, -1658398460, -1658398459, -1658398462] >>> This isn't necessarily bad! To the contrary, in a table of size 2**i, taking the low-order i bits as the initial table index is extremely fast, and there are no collisions at all for dicts indexed by a contiguous range of ints. The same is approximately true when keys are "consecutive" strings. So this gives better-than-random behavior in common cases, and that's very desirable. ... """ There's also a file called dictnotes.txt which has more interesting details about how the implementation is designed. Please note that the term "collision" is used in a slightly different way: it refers to trying to find an empty slot in the dictionary table. Having a collision implies that the hash values of two distinct objects are the same, but you also get collisions in case two distinct objects with different hash values get mapped to the same table entry. An attack can be based on trying to find many objects with the same hash value, or trying to find many objects that, as they get inserted into a dictionary, very often cause collisions due to the collision resolution algorithm not finding a free slot. In both cases, the (slow) object comparisons needed to find an empty slot is what makes the attack practical, if the application puts too much trust into large blobs of input data - which is the actual security issues we're trying to work around here... Given the dictionary implementation notes, I'm even less certain that the randomization change is a good idea. It will likely introduce a performance hit due to both the added complexity in calculating the hash as well as the reduced cache locality of the data in the dict table. I'll upload a patch that demonstrates the collisions counting strategy to show that detecting the problem is easy. Whether just raising an exception is a good idea, is another issue. It may be better to change the tp_hash slot in Python 3.3 to take an argument, so that the dict implementation can use the hash function as universal hash family function (see http://en.wikipedia.org/wiki/Universal_hash). The dict implementation could then alter the hash parameter and recreate the dict table in case the number of collisions exceeds a certain limit, thereby actively taking action instead of just relying on randomness solving the issue in most cases. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue13703> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com