[issue30671] dict: improve lookup function

Tim Peters Mon, 19 Jun 2017 11:19:58 -0700

Tim Peters added the comment:

BTW, I should have spelled this out earlier:  recall that x and y collide on 
the first probe if and only if


    x = y (mod 2**k)   [1]

The second probe never happens unless they do collide on the first probe, so 
when looking at the second probe we can assume that [1] is true.  Since [1] is 
true,

    5*(x % 2**k) + 1 = 5*(y % 2**k) + 1

is also true (and there's nothing special here about "5" and "1" - it would 
hold for any multiplier and any addend).  The second probe just adds (x >> 5) % 
2**k to the left side of that and (y >> 5) % 2**k to the right side of that.  
Therefore the second probes collide if and only if (in addition to [1])

    x >> 5 = y >> 5 (mod 2**k)  [2]

The primary point is that any analysis that ignores the higher-order bits is 
missing the primary thing that matters ;-)

[2] does reveal some potential weaknesses in the scheme.  For example, if k < 
5, some bits of the hash code have no effect on the probe sequence (e.g., if 
k==1, the first probe depends only on the hash code's last bit, and if there's 
collision then a second-probe collision depends only on bit 2**5 -- bits 2**1 
through 2**4 are effectively ignored - but the table only has 2 slots in the 
k==1 case).  Decreasing PERTURB_SHIFT works against that.

OTOH, if k > 5, some of the bits that contributed to [1] being true also feed 
into whether [2] is true.  Increasing PERTURN_SHIFT works against that.

Setting PERTURB_SHIFT=k (i.e., make it depend on the number of slots) avoids 
both, but dosen't appear to actually perform better than always using 5.

The rub is that collisions just aren't a real problem under the current scheme 
- and hashsim.py shows it's doing about as well on that count as the 
theoretical gold standard ("uniform hashing").

If people have time to spend on dicts, I'd rather see them pursue entirely 
different methods, that guarantee to slash the _worst_ case number of probes.  
hashsim.py also shows that while long collision chains are rare, they do occur 
(and they also occur under "uniform hashing").  Then, e.g., we could drop the 
annoying "hash randomization" cruft.  For example, dig into "cuckoo hashing" 
and "Robin Hood hashing".  Just be warned that, regardless of scheme, the 
devil's in the details :-(

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30671>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue30671] dict: improve lookup function

Reply via email to