Tim Peters <t...@python.org> added the comment:

Noting that the failure of high-order bits to propagate is one form of 
"avalanche" failure.  A recent 64-bit hash, SeaHash, takes this approach which 
is said to have provably "perfect" avalanche behavior:

    Py_uhash_t t = (Py_uhash_t)y;
    t *= (Py_uhash_t)0x6eed0e9da4d94a4fULL;
    t ^= (t >> 32) >> (t >> 60);
    t *= (Py_uhash_t)0x6eed0e9da4d94a4fULL;

The giant constant is just a 63-bit "patternless" prime (for whatever reason, 
the author wants this transformation to be easily invertible, so a prime is 
necessary - this is NOT a "crypto" hash).  The first multiply propagates 
low-order bits left.  Then the next line uses high-order bits to change 
low-order bits.  Extracting a variable shift count from the data itself is a 
clever idea taken from the PCG family of PRNGs - you have to work to contrive 
data where this doesn't "act kinda random".  The last line then propagates the 
- now altered by the high-order bits - lower-order bits left again.

Followed by

    x = (x * mult) + t;

this yields a tuple hash that passes all the tests I have.  The only collisions 
are in the new tuple test, which suffers 14.

Better, add the "catastrophic" right-rotate

    t = (t >> 3) | (t << 61);

at the start and it's still only the new tuple test that has a collision - it 
rises to 19 then, close to (but still under) its failure cutoff.

What I haven't tried:  in context it would be nice to eliminate the second 
multiply by the giant constant.  We're doing a multiply anyway to fold `t` into 
`x`, which will propagate the low-order bits left on the next iteration's `x * 
mult`.  That would ruin SeaHash's provable guarantees, but I'm more concerned 
about saving some cycles than purity ;-)  If that added enough collisions to 
push the second tuple test "not much" over the limit, I'd raise its limit.

Gonzo:  "the real" SeaHash duplicates the code above and works on 4 inputs 
simultaneously, designed to keep a modern processor's instruction pipelines as 
busy as possible.  I'd rather not bother (most tuples are short).

So this is principled and, if the SeaHash theory is right, immune to any simple 
kind of catastrophic failure.  Is it worth it?  I'd sleep better, yes ;-)  Note 
that the first multiply by the giant constant can be active at the same time as 
the `x * mult` at the end, so I'm guessing the speed hit would be bearable.

There's no truly _cheap_ way to get good propagation from all bit positions.  
SeaHash has the fastest way to do that I've encountered.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue34751>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to