[issue34751] Hash collisions for tuples

Tim Peters Tue, 25 Sep 2018 09:54:52 -0700


Tim Peters <t...@python.org> added the comment:


> Suppose that there is a hash collision, say hash((3, 3)) ==
> hash((-3, -3)) and you change the hashing algorithm to fix
> this collision.

There are _two_ hash functions at play in that collision:  the tuple hash 
function, and the integer hash function.  Regardless of whether the _tuple_ 
hash function does

    t ^= t << shift;
or
    t += t >> shift;

or anything else involving just `t`, that only directly affects the result of 
the _int_ hash function.  Whose low order bits cannot be improved in general - 
they're already optimally spread out in the most important cases.

> If that change does NOT affect the lower N bits,
> then you still have a collision hash((3, 3)) % 2**N ==
> hash((-3, -3)) % 2**N. This is relevant because the dict
> implementation starts by taking the lower bits first. So
> ideally we want to avoid hash collisions in the lower
> bits too.

Which the int hash on its own already does a superb job of doing in the most 
important cases.  If you feel you just have to mess with low-order bits, do it 
instead on the _tuple_ hash intermediate results, not on its inputs.  Like

    x += x >> 16;

after t is folded in to x.  It's x you care about directly, not t.  Damaging 
desirable properties in t to _indirectly_ gain something in x is too clever by 
half.

Mucking with t to avoid the nested-tuple catastrophes is worth it, but I prefer 
to do that with as little damage to t's desirable properties as reasonably 
possible.

> This may also be a reason to avoid the standard FNV
> multipliers: the 64-bit FNV multiplier was chosen with
> the full 64-bit hash range in mind. It was never meant
> to work well when truncated to some lower N bits.

Do you believe any other multiplier would work better toward that end?  For any 
odd multiplier M, the last k bits of i*M are determined solely by the last k 
bits of i, and

    [(last k bits of i*M) for i in range(anything, anything + 2**k)]

is a permutation of range(2**k).  The high order bits of i have nothing to do 
with it, and any odd multiplier permutes the lower k bits over any stretch of 
2**k consecutive multiplicands.

Extracting low-order bits is intended to be done by "xor folding" in FNV, and I 
_expect_ the same would be prudent for any other multiplier:

https://tools.ietf.org/html/draft-eastlake-fnv-15#page-7

The dict and set lookup functions already do something stronger than xor 
folding to bring high bits into play, but do so incrementally as the probe 
sequence unfolds.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue34751>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue34751] Hash collisions for tuples

Reply via email to