[issue34751] Hash collisions for tuples

Tim Peters Wed, 03 Oct 2018 12:40:59 -0700


Tim Peters <t...@python.org> added the comment:


Thanks for looking at xxHash!  An advantage is that it already comes in 32- and 
64-bit versions.

> A (simplified and slightly modified version of) xxHash
> seems to work very well, much better than SeaHash.

?  I've posted several SeaHash cores that suffer no collisions at all in any of 
our tests (including across every "bad example" in these 100+ messages), except 
for "the new" tuple test.  Which it also passed, most recently with 7 
collisions.  That was under 64-bit builds, though, and from what follows I 
figure you're only looking at 32-bit builds for now.

>  Just like SeaHash, xxHash also works in parallel. But I'm not
> doing that and just using this for the loop:
>
>    for y in t:
>        y ^= y * (PRIME32_2 - 1)
>        acc += y
>        acc = ((acc << 13) + (acc >> 19))  # rotate left by 13 bits
>        acc *= MULTIPLIER
>
> Plain xxHash does "y *= PRIME32_2" or equivalently
> "y += y * (PRIME32_2 - 1)". Replacing that += by ^= helps
> slightly with my new tuple test.

Taking an algorithm in wide use that's already known to get a top score on 
SMHasher and fiddling it to make a "slight" improvement in one tiny Python test 
doesn't make sense to me.  Does the variant also score well under SMHasher?  "I 
don't see why it wouldn't" is not an answer to that ;-)

Note that the change also slows the loop a bit - "acc += y" can't begin until 
the multiply finishes, and the following "acc *=" can't begin until the 
addition finishes.  Lengthening the critical path needs to buy something real 
to justify the extra expense.  I don't see it here.

For posterity:  the 64-bit version of xxHash uses different primes, and rotates 
left by 31 instead of by 13.

Note:  I'm assuming that by "PRIME32_2" you mean 2246822519U, and that 
"MULTIPLIER" means 2654435761U.  Which appear to be the actual multipliers used 
in, e.g.,

https://github.com/RedSpah/xxhash_cpp/blob/master/xxhash/xxhash.hpp

As always, I'm not a fan of making gratuitous changes.  In any case, without 
knowing the actual values you're using, nobody can replicate what you're doing.

That said, I have no reason to favor SeaHash over xxHash, and xxHash also has a 
real advantage in that it apparently didn't _need_ the "t ^= t << 1" 
permutation to clear the new tuple test's masses of replicated sign bits.

But the more we make changes to either, the more work _we_ have to do to ensure 
we haven't introduced subtle weaknesses.  Which the Python test suite will 
never be substantial enough to verify - we're testing almost nothing about hash 
behavior.  Nor should we:  people already wrote substantial test suites 
dedicated to that sole purpose, and we should aim to be "mere consumers" of 
functions that pass _those_ tests.  Python's test suite is more to ensure that 
known problems _in Python_ don't recur over the decades.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue34751>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue34751] Hash collisions for tuples

Reply via email to