[issue29304] dict: simplify lookup function

2017-06-14 Thread Dmitry Rubanovich

Dmitry Rubanovich added the comment:

lookdict_index() (and the rest of the files in dictobject.c) are using 
unnecessarily complicated perturb mechanism.  And, in fact, it's slower than 
the simpler case.

Instead of this:

for (size_t perturb = hash;;) {
perturb >>= PERTURB_SHIFT;
i = mask & ((i << 2) + i + perturb + 1);


it should do this:

for (size_t perturb = hash;;) {
i = mask & ((i << 1) + perturb + 1);
perturb >>= PERTURB_SHIFT;


This would not only save an instruction (a minor issue), but it would also 
reduce collisions.

I've attached a file which calculates frequencies of collisions for 
demonstration purposes.  It shows that the calculation, as it stands right now, 
does not create a 1-1 mapping even on the 1st iteration through the loop.  
Moving PERTURB_SHIFT to the line before the calculation does reduce the density 
of the collision space.  But using the calculation, which I proposed, 
eliminates collisions on the 1st iteration completely and reduces it on most 
subsequent iterations.

--
components: +Interpreter Core
nosy: +Dmitry Rubanovich
type:  -> enhancement
versions: +Python 3.7
Added file: http://bugs.python.org/file46952/collisions_count.py

___
Python tracker 
<http://bugs.python.org/issue29304>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30671] dict: simplify and improve lookup function

2017-06-15 Thread Dmitry Rubanovich

New submission from Dmitry Rubanovich:

lookdict_index() (and the rest of the files in dictobject.c) are using 
unnecessarily complicated perturb mechanism.  And, in fact, it's slower than 
the simpler case.

Instead of this:

for (size_t perturb = hash;;) {
perturb >>= PERTURB_SHIFT;
i = mask & ((i << 2) + i + perturb + 1);


it should do this:

for (size_t perturb = hash;;) {
i = mask & ((i << 1) + perturb + 1);
perturb >>= PERTURB_SHIFT;


This would not only save an instruction (a minor issue), but it would also 
reduce collisions.

I've attached a file which calculates frequencies of collisions for 
demonstration purposes.  It shows that the calculation, as it stands right now, 
does not create a 1-1 mapping even on the 1st iteration through the loop.  
Moving PERTURB_SHIFT to the line before the calculation does reduce the density 
of the collision space.  But using the calculation, which I proposed, 
eliminates collisions on the 1st iteration completely and reduces it on most 
subsequent iterations.

--
components: Interpreter Core
messages: 296072
nosy: Dmitry Rubanovich, inada.naoki, rhettinger, serhiy.storchaka, xiang.zhang
priority: normal
severity: normal
status: open
title: dict: simplify and improve lookup function
type: enhancement
versions: Python 3.7
Added file: http://bugs.python.org/file46953/collisions_count.py

___
Python tracker 
<http://bugs.python.org/issue30671>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30671] dict: simplify and improve lookup function

2017-06-15 Thread Dmitry Rubanovich

Dmitry Rubanovich added the comment:

I am looking at the 3.6.1 code.  I am not trying to simulate the index lookup 
as the lookup function would do it.  

There is no point in that.  The point of the script is to detect collisions 
among perturbed secondary indexes rather than between primary hashes and 
secondary.  To that end, it applies the perturbed secondary hash function to 
**all** primary indexes and stores the results.  This allows to judge the 
quality of the secondary hash function.  The "old" one was bad because it 
automatically created collisions by multiplying by zero-divisors of the 2^k 
ring.  The one used in 3.6.1 is better because it  doesn't always do that, but 
it still multiplies by zero-divisors of the ring when h mod 2^k is equal to (h 
>> 5) mod 2^k.  This multiplication by zero-divisors of the 2^k ring is what 
produces collisions.  The script simply counts them.  

The later iterations of the loop are not very relevant.  They will almost 
certainly produce further collisions, but that's unavoidable.  

By the way, just as a test, I just changed the value of PERTURB_SHIFT to 1 and, 
in my proposed solution, it produced lower collision counts, after 3 runs of 
the loop, in virtually all rings (ie, mod all values of 2^k for k from 8 to 
20).  Intuitively it makes sense because there is less information loss.  But I 
don't have a formal proof as to why that should be.

--

___
Python tracker 
<http://bugs.python.org/issue30671>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30671] dict: simplify and improve lookup function

2017-06-15 Thread Dmitry Rubanovich

Dmitry Rubanovich added the comment:

Yes, I do understand it.  

But the statement in lines 166, 167: "For any initial j in range(2**i), 
repeating that 2**i times generates each int in range(2**i) exactly once" does 
not hold when the perturb is added.  2**i times will not be enough to generate 
all elements of the ring when some of multipliers are zero-divisors.  

Specifically, if you use j=((5*j) + P + 1) mod 2**i, you are effectively 
multiplying by a zero divisors every time P = j mod 2.  And I don't mean "mod 
2**i."  I do mean "mod 2."  Which means anytime P (which changes a few times 
and eventually becomes 0), has the same parity as j, you are multiplying by a 
zero-divisor.  Because P is eventually 0, you will go through all the values 
**eventually**.  But for all values of P for which there is a P' such that 
P=/=P' and ((5*j) + P + 1) = ((5*j) + P' + 1) mod 2**i, the number of times 
you'll need to apply the function will be higher than 2**i.  And that's in the 
worst case scenario.

In the best case scenario, the collision probability can be gathered from 
looking at the values of my script printed by 
print_collision_counts(py3_6_1_lookdict_perturb).  It can be as low as ~1/20 
and as high as ~1/3 depending on the value of i.

The main speed up we are seeking is to avoid a collision early on.  And we are 
less concerned about collisions later on.  Anecdotally, if your dict has 256 
buckets, then the chance of collision is 1 in ~3.5.  Which is an improvement 
over 1 in 2, but still pretty high.  

Ok, how about this: to avoid the edge cases, unroll the 1st secondary hash key 
to use j = 2*j + P + 1.  So try to test for it before the loop.  But leave 5*j 
+ P + 1 in the loop as is.  

Although, to be honest if PERTURB_SHIFT is changed to 1 and we have a dict with 
a key that causes 64 collisions, this may be a good time to resize even if we 
are still sparse.  On the other hand, this might create an attack vector with 
some magic value which causes resizes of near empty dict's.  So maybe not...  
Certainly not without further analysis.

BTW, and it's mostly stylistic, but except for the "*hashpos = i & mask;" line 
in "find_empty_slot()", applying of the mask can be moved to "dk_get_index()".  
Again, I am looking at the released 3.6.1 code, so if this is already done, 
then never mind.

As another note, changing PERTURB_SHIFT to 1 causes a near-universal reduction 
in collisions (for all strategies).  So if nothing else, that's probably an 
improvement.

--
Added file: http://bugs.python.org/file46956/fewer_collisions_count.py

___
Python tracker 
<http://bugs.python.org/issue30671>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30671] dict: simplify and improve lookup function

2017-06-15 Thread Dmitry Rubanovich

Dmitry Rubanovich added the comment:

Tim,

I am not testing randomly.  The scripts calculate secondary hash for **each** 
value in a ring to see how often this results in duplicate (or triplicate, 
etc.) values.  And they do it for each (mod 2**i) ring for i between 8 and 20.

Then (mostly as an afterthought) the scripts calculate how often each scheme 
results in a collision if more than 1 secondary hash index is calculated.  I 
used 3 secondary indexes as a demonstration of the "afterthought" part.

I do understand that the logic relies on being able to reach each value in 
0..-1+2**i in the worst case.  Isn't that though accomplished by my latest 
proposal? It was this:

"...unroll the 1st secondary hash key to use j = 2*j + P + 1.  So try to test 
for it before the loop.  But leave 5*j + P + 1 in the loop as is."

In the code that would mean changing of the loop in, for example, 
lookdict_index() from 

for (size_t perturb = hash;;) {
perturb >>= PERTURB_SHIFT;
i = mask & ((i << 2) + i + perturb + 1);
ix = dk_get_index(k, i);
if (ix == index) {
return i;
}
if (ix == DKIX_EMPTY) {
return DKIX_EMPTY;
}
}

to this:

size_t perturb;


perturb = hash;
i = mask & ((i << 1) + perturb + 1); /* < key line */
ix = dk_get_index(k, i);
if (ix == index) {
return i;
}
if (ix == DKIX_EMPTY) {
return DKIX_EMPTY;
}

for (;;) {
/* nothing changes in this loop */
perturb >>= PERTURB_SHIFT;
i = mask & ((i << 2) + i + perturb + 1);
ix = dk_get_index(k, i);
if (ix == index) {
return i;
}
if (ix == DKIX_EMPTY) {
return DKIX_EMPTY;
}
}

And, of course, it would mean adding the same precheck in front of all loops 
which go through this secondary index calculation.

This prevents preventable collisions for the hashes, h, such that

h mod 2**k is equal to (h >> 5) mod 2**k,

where 2**k is the dict size.  This frequency of such occurrence for each dict 
size is what's printed by 

print_collision_counts(py3_6_1_lookdict_perturb) 

in either of the attached scripts.  Given that, for instance, it's 91 for 
dict's of size 256, it seems rather ripe for improvement.

--

___
Python tracker 
<http://bugs.python.org/issue30671>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30671] dict: improve lookup function

2017-06-16 Thread Dmitry Rubanovich

Dmitry Rubanovich added the comment:

Changing the title, to remove "simplify", since the discussion revealed that 
the potential improvement would have to be treated as a special case and, 
therefore, would not simplify the code.

--
title: dict: simplify and improve lookup function -> dict: improve lookup 
function

___
Python tracker 
<http://bugs.python.org/issue30671>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30671] dict: improve lookup function

2017-06-18 Thread Dmitry Rubanovich

Dmitry Rubanovich added the comment:

A few notes on my motivation and other comments I made(not necessarily in the 
order of significance):

1. I started looking at this because of the fix in Issue28201.  It effectively 
improves on the difference between 

print_collision_counts(previous_lookdict_perturb)

and 

print_collision_counts(py3_6_1_lookdict_perturb)

because the before-28201 code was, in effect, multiplying by 6 instead of 5 
and, therefore, always had a 1/2 chance of secondary hashes colliding.

2. What I demonstrated was that the fix, as it stands, didn't do as much magic 
as we would hope.  Because it produced a scheme in which **first** 
**secondary** hashes would collide at a rate as high as 1/3.

3. I somewhat regret extending the script to demonstrate what happens to 2nd, 
3rd secondary hashes because it created the perception that I was trying to 
demonstrate a simulation.  I wasn't.  I was trying to demonstrate a 
calculation.  Well, calculations for each dictionary size up to 2**20.  

4. I detect apathy towards the fix, but I get the sense that it's mostly due to 
the fact that I initially didn't realize that I should have squarely limited 
the issue to the 1st secondary hash.  And after I made the initial error in the 
calculation (as pointed out by Inada Naoki) which would have produced an 
unstable calculation of the 2nd, 3rd, 4th, etc. secondary hashes (as pointed 
out by Tim), there is a lot of push back.  

But the fix I propose DOES improve on ***1st secondary hash*** because it 
eliminates any possibility of collision.  That is to say, no 2 1st secondary 
hashes will be the same for any 2 different primary hashes.  

And in the latest iteration of this proposal I did limit the scope of the 
enhancement to only 1st secondary hash.  As it happens, this collision actually 
does not occur for dictionaries of sizes 8,16,32.  They only begin to be 
possible for dictionaries of sizes 64 or greater.  The latest script 
demonstrates it.

5.  Sorry, Tim, but I should have reserved my statements to only the 
calculation I was certain about.  The further comments about what happens to 
later secondary hashes was a speculation on my part and this calculation was 
not revealing much about it.  If you want to consider what happens with later 
secondary indexes (as you do in hashsim.py), that's a different problem.  To 
reiterate, my intention was to further improve on the fix in 28201.


On the size of perturb:

6.  Since hashes of strings and other objects are pointers, and can safely be 
assumed to be pointers addressing the heap, they are aligned on the boundary 2 
* sizeof(void*).  That's 16 in 64-bit builds.  So the last 4 bits are 0.  And 
(depending on how much heap is used) the higher bits of the pointers are also 
similar in applications which don't use a lot of memory.  

For example, in an application in which 16MiB of heap is used, only bits 
35(=63-4-24) through 59(63-4) are truly random.  The rest of the bits can be 
assumed to not change, but they are not reached until 5th iteration of the 
loop, so this represents an already mildly-degenerate case in this use case (in 
which most dictionary keys are strings).

I haven't given much thought to which bits will have the most entropy in 
general, but that's going too far off topic.

--
type: enhancement -> performance
Added file: http://bugs.python.org/file46961/1st_secondary_collision_counts.py

___
Python tracker 
<http://bugs.python.org/issue30671>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com