Dmitry Rubanovich added the comment:

A few notes on my motivation and other comments I made(not necessarily in the 
order of significance):

1. I started looking at this because of the fix in Issue28201.  It effectively 
improves on the difference between 

print_collision_counts(previous_lookdict_perturb)

and 

print_collision_counts(py3_6_1_lookdict_perturb)

because the before-28201 code was, in effect, multiplying by 6 instead of 5 
and, therefore, always had a 1/2 chance of secondary hashes colliding.

2. What I demonstrated was that the fix, as it stands, didn't do as much magic 
as we would hope.  Because it produced a scheme in which **first** 
**secondary** hashes would collide at a rate as high as 1/3.

3. I somewhat regret extending the script to demonstrate what happens to 2nd, 
3rd secondary hashes because it created the perception that I was trying to 
demonstrate a simulation.  I wasn't.  I was trying to demonstrate a 
calculation.  Well, calculations for each dictionary size up to 2**20.  

4. I detect apathy towards the fix, but I get the sense that it's mostly due to 
the fact that I initially didn't realize that I should have squarely limited 
the issue to the 1st secondary hash.  And after I made the initial error in the 
calculation (as pointed out by Inada Naoki) which would have produced an 
unstable calculation of the 2nd, 3rd, 4th, etc. secondary hashes (as pointed 
out by Tim), there is a lot of push back.  

But the fix I propose DOES improve on ***1st secondary hash*** because it 
eliminates any possibility of collision.  That is to say, no 2 1st secondary 
hashes will be the same for any 2 different primary hashes.  

And in the latest iteration of this proposal I did limit the scope of the 
enhancement to only 1st secondary hash.  As it happens, this collision actually 
does not occur for dictionaries of sizes 8,16,32.  They only begin to be 
possible for dictionaries of sizes 64 or greater.  The latest script 
demonstrates it.

5.  Sorry, Tim, but I should have reserved my statements to only the 
calculation I was certain about.  The further comments about what happens to 
later secondary hashes was a speculation on my part and this calculation was 
not revealing much about it.  If you want to consider what happens with later 
secondary indexes (as you do in hashsim.py), that's a different problem.  To 
reiterate, my intention was to further improve on the fix in 28201.


On the size of perturb:

6.  Since hashes of strings and other objects are pointers, and can safely be 
assumed to be pointers addressing the heap, they are aligned on the boundary 2 
* sizeof(void*).  That's 16 in 64-bit builds.  So the last 4 bits are 0.  And 
(depending on how much heap is used) the higher bits of the pointers are also 
similar in applications which don't use a lot of memory.  

For example, in an application in which 16MiB of heap is used, only bits 
35(=63-4-24) through 59(63-4) are truly random.  The rest of the bits can be 
assumed to not change, but they are not reached until 5th iteration of the 
loop, so this represents an already mildly-degenerate case in this use case (in 
which most dictionary keys are strings).

I haven't given much thought to which bits will have the most entropy in 
general, but that's going too far off topic.

----------
type: enhancement -> performance
Added file: http://bugs.python.org/file46961/1st_secondary_collision_counts.py

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30671>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to