Marko Rauhamaa wrote: > Chris Angelico <ros...@gmail.com>: > >>> but hash(x) == hash(y) does NOT imply that x == y. >> >> Hello, pigeonhole principle :) If this were false - that is, if equal >> hashes DID imply equal objects - it would be necessary to completely >> encode an object's state in its hash, and hashes would be impossibly >> large. This would, in fact, destroy their value completely. > > Well, modern computing assumes precisely: > > hash(x) == hash(y) => x == y
I really don't think that the behaviour of Python's built-in hash() function is of fundamental concern to all of modern computing. Python's hash() function is of fundamental concern to Python dicts, and that's all. > That principle is at play with strong authentication (HTTPS et al), > version control (git et al), git is not even written in Python. HTTPS is a protocol, so it could be written in Python, but typically isn't. Neither of them use Python's hash(), because the people writing these applications aren't insane. Collisions in Python's hash() are trivially easy to find, and the existence of such collisions is well-known. Perhaps you're talking about cryptographically strong hash functions, of which there are a few? If so, a brief note that you were changing the subject would be appreciated :-) The pigeonhole principle applies to crypto hash functions too, and there is active research into creating newer and stronger hash functions as the older ones are broken. For instance, md4, md5, md2 and sha-0 hash functions have all be broken, and shouldn't be used for anything where security is of concern. > A handful of bits uniquely identify an object regardless of its size. No. The principle being applied is not that a handful of bits *uniquely* identifies an object, since that is clearly and obviously untrue: consider objects of size 1K, there are 2**1024 possible such objects. I don't know what "a handful" of bits is, but let's say 128 bits, so there are 2**128 possible hashes. It doesn't take much mathematics to see that you cannot *uniquely* hash 2**1024 objects into only 2**128 checksums without quite a few collisions. The actual principle being applied is that, with a cryptographically strong hash function, a handful of bits can *practically* identify an object regardless of its size. Let's take that 128 bit checksum again: that means that there are 340282366920938463463374607431768211456 possible checksums. If the input files are unpredictably hashed over that entire range, on average you would have to generate half that number of hashes before stumbling across a collision purely by chance. Half of 2**128 is still a rather big number: if you generated a million hashes a second, on average it would take you considerably more than a billion years to accidentally hit a collision. For practical purposes, that's good enough. But note that there are numerous provisos there, including: - It relies on the hash function generating a large enough handful of bits. An 8 bit checksum only has 2**8 = 256 unique values, so you would find a collision pretty quickly. - It relies on the mapping of original object to checksum being randomly distributed. If they are not randomly distributed, then there will be sets of objects which collide more often than expected. For instance, suppose that every object containing a 0 byte hashed to the same value. Then, on average, you would not have to wait very long before finding your first collision. - It relies on the checksum being unpredictable, to prevent substitution attacks: you're expecting object x with checksum a, but somebody substitutes object y with checksum a instead. Python's hash() function is not cryptographically strong and makes no claim to be. -- Steven -- https://mail.python.org/mailman/listinfo/python-list