[issue34751] Hash collisions for tuples

Tim Peters Sat, 22 Sep 2018 12:27:01 -0700


Tim Peters <t...@python.org> added the comment:


So you don't know of any directly relevant research either.  "Offhand I can't 
see anything wrong" is better than nothing, but very far from "and we know it 
will be OK because [see references 1 and 2]".

That Bernstein's DJBX33A has been widely used inspires no confidence 
whatsoever.  As already explained, Python _did_ use DJBX33X (with a different 
multiplier), and it systematically screwed up catastrophically in some nested 
tuple cases already spelled out.  Bernstein himself moved to DJBX33X (using xor 
instead of addition), and that would have screwed up too on a mix of smallish 
integers of both signs.

What Python does now, except for changing the multiplier, is essentially 
version 1a of the _also_ very widely used - but more recent - Fowler/Noll/Vo 
hash family[1]:

    # Version 1a, recommended over version 1 (which does * before ^).
    hash = offset_basis
    for each octet_of_data to be hashed
        hash = hash xor octet_of_data
        hash = hash * FNV_prime
    return hash

They did extensive studies, and very deliberately use a prime multiplier 
subject to a number of other constraints.  Not based on "offhand I can't see 
why not", but based on analysis and running extensive tests.

But, same as with Bernstein's variants (which predate FNV's work on picking 
"good" multipliers), they were developed in the context of hashing sequences of 
unsigned 8-bit integers.  There's no a priori reason I can see to expect them 
to work well in contexts other than that.  Indeed, FNV puts special constraints 
on the last 8 bits of the primes they pick, presumably because they're only 
concerned with hashing sequences of 8-bit values.

FYI, for 32-bit hashes they use multiplier 16777619, and for 64-bit 
1099511628211.

In the absence of coherent analysis directly relevant to what Python actually 
needs here, we're all flying blind.  What we do have is over a decade of 
positive real-world experience with the made-up hacks Python used to worm 
around a class of gross flaws its prior DJBX33X approach suffered, taking 
DJBX33X out of its original context and applying it in an area it wasn't 
designed for.  Which made it close to FNV-1a, but also in a context _that_ 
wasn't designed for.

However, it _is_ structurally close to FNV-1a, and using prime multipliers was 
a big deal to them.  "But offhand I don't see why" isn't a good enough reason 
to abandon that.  To the contrary, in the absence of _proof_ that it doesn't 
actually matter, I'd be happiest using the same multipliers (given above) and 
initial values "the standard" FNV-1a algorithms use.

[1] http://www.isthe.com/chongo/tech/comp/fnv/index.html#FNV-1a

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue34751>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue34751] Hash collisions for tuples

Reply via email to