Re: Dict when defining not returning multi value key error

Steven D'Aprano Fri, 01 Aug 2014 18:52:26 -0700

Marko Rauhamaa wrote:

> Chris Angelico <ros...@gmail.com>:
> 
>>> but hash(x) == hash(y) does NOT imply that x == y.
>>
>> Hello, pigeonhole principle :) If this were false - that is, if equal
>> hashes DID imply equal objects - it would be necessary to completely
>> encode an object's state in its hash, and hashes would be impossibly
>> large. This would, in fact, destroy their value completely.
> 
> Well, modern computing assumes precisely:
> 
>    hash(x) == hash(y) => x == y


I really don't think that the behaviour of Python's built-in hash() function
is of fundamental concern to all of modern computing. Python's hash()
function is of fundamental concern to Python dicts, and that's all.


> That principle is at play with strong authentication (HTTPS et al),
> version control (git et al),

git is not even written in Python. HTTPS is a protocol, so it could be
written in Python, but typically isn't. Neither of them use Python's
hash(), because the people writing these applications aren't insane.
Collisions in Python's hash() are trivially easy to find, and the existence
of such collisions is well-known.

Perhaps you're talking about cryptographically strong hash functions, of
which there are a few? If so, a brief note that you were changing the
subject would be appreciated :-)

The pigeonhole principle applies to crypto hash functions too, and there is
active research into creating newer and stronger hash functions as the
older ones are broken. For instance, md4, md5, md2 and sha-0 hash functions
have all be broken, and shouldn't be used for anything where security is of
concern.


> A handful of bits uniquely identify an object regardless of its size.

No. The principle being applied is not that a handful of bits *uniquely*
identifies an object, since that is clearly and obviously untrue: consider
objects of size 1K, there are 2**1024 possible such objects. I don't know
what "a handful" of bits is, but let's say 128 bits, so there are 2**128
possible hashes. It doesn't take much mathematics to see that you cannot
*uniquely* hash 2**1024 objects into only 2**128 checksums without quite a
few collisions.

The actual principle being applied is that, with a cryptographically strong
hash function, a handful of bits can *practically* identify an object
regardless of its size. Let's take that 128 bit checksum again: that means
that there are 340282366920938463463374607431768211456 possible checksums.
If the input files are unpredictably hashed over that entire range, on
average you would have to generate half that number of hashes before
stumbling across a collision purely by chance. Half of 2**128 is still a
rather big number: if you generated a million hashes a second, on average
it would take you considerably more than a billion years to accidentally
hit a collision.

For practical purposes, that's good enough.

But note that there are numerous provisos there, including:

- It relies on the hash function generating a large enough handful of bits.
An 8 bit checksum only has 2**8 = 256 unique values, so you would find a
collision pretty quickly.

- It relies on the mapping of original object to checksum being randomly
distributed. If they are not randomly distributed, then there will be sets
of objects which collide more often than expected. For instance, suppose
that every object containing a 0 byte hashed to the same value. Then, on
average, you would not have to wait very long before finding your first
collision.

- It relies on the checksum being unpredictable, to prevent substitution
attacks: you're expecting object x with checksum a, but somebody
substitutes object y with checksum a instead.


Python's hash() function is not cryptographically strong and makes no claim
to be.



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Dict when defining not returning multi value key error

Reply via email to