[issue29304] dict: simplify lookup function
Dmitry Rubanovich added the comment: lookdict_index() (and the rest of the files in dictobject.c) are using unnecessarily complicated perturb mechanism. And, in fact, it's slower than the simpler case. Instead of this: for (size_t perturb = hash;;) { perturb >>= PERTURB_SHIFT; i = mask & ((i << 2) + i + perturb + 1); it should do this: for (size_t perturb = hash;;) { i = mask & ((i << 1) + perturb + 1); perturb >>= PERTURB_SHIFT; This would not only save an instruction (a minor issue), but it would also reduce collisions. I've attached a file which calculates frequencies of collisions for demonstration purposes. It shows that the calculation, as it stands right now, does not create a 1-1 mapping even on the 1st iteration through the loop. Moving PERTURB_SHIFT to the line before the calculation does reduce the density of the collision space. But using the calculation, which I proposed, eliminates collisions on the 1st iteration completely and reduces it on most subsequent iterations. -- components: +Interpreter Core nosy: +Dmitry Rubanovich type: -> enhancement versions: +Python 3.7 Added file: http://bugs.python.org/file46952/collisions_count.py ___ Python tracker <http://bugs.python.org/issue29304> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue30671] dict: simplify and improve lookup function
New submission from Dmitry Rubanovich: lookdict_index() (and the rest of the files in dictobject.c) are using unnecessarily complicated perturb mechanism. And, in fact, it's slower than the simpler case. Instead of this: for (size_t perturb = hash;;) { perturb >>= PERTURB_SHIFT; i = mask & ((i << 2) + i + perturb + 1); it should do this: for (size_t perturb = hash;;) { i = mask & ((i << 1) + perturb + 1); perturb >>= PERTURB_SHIFT; This would not only save an instruction (a minor issue), but it would also reduce collisions. I've attached a file which calculates frequencies of collisions for demonstration purposes. It shows that the calculation, as it stands right now, does not create a 1-1 mapping even on the 1st iteration through the loop. Moving PERTURB_SHIFT to the line before the calculation does reduce the density of the collision space. But using the calculation, which I proposed, eliminates collisions on the 1st iteration completely and reduces it on most subsequent iterations. -- components: Interpreter Core messages: 296072 nosy: Dmitry Rubanovich, inada.naoki, rhettinger, serhiy.storchaka, xiang.zhang priority: normal severity: normal status: open title: dict: simplify and improve lookup function type: enhancement versions: Python 3.7 Added file: http://bugs.python.org/file46953/collisions_count.py ___ Python tracker <http://bugs.python.org/issue30671> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue30671] dict: simplify and improve lookup function
Dmitry Rubanovich added the comment: I am looking at the 3.6.1 code. I am not trying to simulate the index lookup as the lookup function would do it. There is no point in that. The point of the script is to detect collisions among perturbed secondary indexes rather than between primary hashes and secondary. To that end, it applies the perturbed secondary hash function to **all** primary indexes and stores the results. This allows to judge the quality of the secondary hash function. The "old" one was bad because it automatically created collisions by multiplying by zero-divisors of the 2^k ring. The one used in 3.6.1 is better because it doesn't always do that, but it still multiplies by zero-divisors of the ring when h mod 2^k is equal to (h >> 5) mod 2^k. This multiplication by zero-divisors of the 2^k ring is what produces collisions. The script simply counts them. The later iterations of the loop are not very relevant. They will almost certainly produce further collisions, but that's unavoidable. By the way, just as a test, I just changed the value of PERTURB_SHIFT to 1 and, in my proposed solution, it produced lower collision counts, after 3 runs of the loop, in virtually all rings (ie, mod all values of 2^k for k from 8 to 20). Intuitively it makes sense because there is less information loss. But I don't have a formal proof as to why that should be. -- ___ Python tracker <http://bugs.python.org/issue30671> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue30671] dict: simplify and improve lookup function
Dmitry Rubanovich added the comment: Yes, I do understand it. But the statement in lines 166, 167: "For any initial j in range(2**i), repeating that 2**i times generates each int in range(2**i) exactly once" does not hold when the perturb is added. 2**i times will not be enough to generate all elements of the ring when some of multipliers are zero-divisors. Specifically, if you use j=((5*j) + P + 1) mod 2**i, you are effectively multiplying by a zero divisors every time P = j mod 2. And I don't mean "mod 2**i." I do mean "mod 2." Which means anytime P (which changes a few times and eventually becomes 0), has the same parity as j, you are multiplying by a zero-divisor. Because P is eventually 0, you will go through all the values **eventually**. But for all values of P for which there is a P' such that P=/=P' and ((5*j) + P + 1) = ((5*j) + P' + 1) mod 2**i, the number of times you'll need to apply the function will be higher than 2**i. And that's in the worst case scenario. In the best case scenario, the collision probability can be gathered from looking at the values of my script printed by print_collision_counts(py3_6_1_lookdict_perturb). It can be as low as ~1/20 and as high as ~1/3 depending on the value of i. The main speed up we are seeking is to avoid a collision early on. And we are less concerned about collisions later on. Anecdotally, if your dict has 256 buckets, then the chance of collision is 1 in ~3.5. Which is an improvement over 1 in 2, but still pretty high. Ok, how about this: to avoid the edge cases, unroll the 1st secondary hash key to use j = 2*j + P + 1. So try to test for it before the loop. But leave 5*j + P + 1 in the loop as is. Although, to be honest if PERTURB_SHIFT is changed to 1 and we have a dict with a key that causes 64 collisions, this may be a good time to resize even if we are still sparse. On the other hand, this might create an attack vector with some magic value which causes resizes of near empty dict's. So maybe not... Certainly not without further analysis. BTW, and it's mostly stylistic, but except for the "*hashpos = i & mask;" line in "find_empty_slot()", applying of the mask can be moved to "dk_get_index()". Again, I am looking at the released 3.6.1 code, so if this is already done, then never mind. As another note, changing PERTURB_SHIFT to 1 causes a near-universal reduction in collisions (for all strategies). So if nothing else, that's probably an improvement. -- Added file: http://bugs.python.org/file46956/fewer_collisions_count.py ___ Python tracker <http://bugs.python.org/issue30671> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue30671] dict: simplify and improve lookup function
Dmitry Rubanovich added the comment: Tim, I am not testing randomly. The scripts calculate secondary hash for **each** value in a ring to see how often this results in duplicate (or triplicate, etc.) values. And they do it for each (mod 2**i) ring for i between 8 and 20. Then (mostly as an afterthought) the scripts calculate how often each scheme results in a collision if more than 1 secondary hash index is calculated. I used 3 secondary indexes as a demonstration of the "afterthought" part. I do understand that the logic relies on being able to reach each value in 0..-1+2**i in the worst case. Isn't that though accomplished by my latest proposal? It was this: "...unroll the 1st secondary hash key to use j = 2*j + P + 1. So try to test for it before the loop. But leave 5*j + P + 1 in the loop as is." In the code that would mean changing of the loop in, for example, lookdict_index() from for (size_t perturb = hash;;) { perturb >>= PERTURB_SHIFT; i = mask & ((i << 2) + i + perturb + 1); ix = dk_get_index(k, i); if (ix == index) { return i; } if (ix == DKIX_EMPTY) { return DKIX_EMPTY; } } to this: size_t perturb; perturb = hash; i = mask & ((i << 1) + perturb + 1); /* < key line */ ix = dk_get_index(k, i); if (ix == index) { return i; } if (ix == DKIX_EMPTY) { return DKIX_EMPTY; } for (;;) { /* nothing changes in this loop */ perturb >>= PERTURB_SHIFT; i = mask & ((i << 2) + i + perturb + 1); ix = dk_get_index(k, i); if (ix == index) { return i; } if (ix == DKIX_EMPTY) { return DKIX_EMPTY; } } And, of course, it would mean adding the same precheck in front of all loops which go through this secondary index calculation. This prevents preventable collisions for the hashes, h, such that h mod 2**k is equal to (h >> 5) mod 2**k, where 2**k is the dict size. This frequency of such occurrence for each dict size is what's printed by print_collision_counts(py3_6_1_lookdict_perturb) in either of the attached scripts. Given that, for instance, it's 91 for dict's of size 256, it seems rather ripe for improvement. -- ___ Python tracker <http://bugs.python.org/issue30671> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue30671] dict: improve lookup function
Dmitry Rubanovich added the comment: Changing the title, to remove "simplify", since the discussion revealed that the potential improvement would have to be treated as a special case and, therefore, would not simplify the code. -- title: dict: simplify and improve lookup function -> dict: improve lookup function ___ Python tracker <http://bugs.python.org/issue30671> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue30671] dict: improve lookup function
Dmitry Rubanovich added the comment: A few notes on my motivation and other comments I made(not necessarily in the order of significance): 1. I started looking at this because of the fix in Issue28201. It effectively improves on the difference between print_collision_counts(previous_lookdict_perturb) and print_collision_counts(py3_6_1_lookdict_perturb) because the before-28201 code was, in effect, multiplying by 6 instead of 5 and, therefore, always had a 1/2 chance of secondary hashes colliding. 2. What I demonstrated was that the fix, as it stands, didn't do as much magic as we would hope. Because it produced a scheme in which **first** **secondary** hashes would collide at a rate as high as 1/3. 3. I somewhat regret extending the script to demonstrate what happens to 2nd, 3rd secondary hashes because it created the perception that I was trying to demonstrate a simulation. I wasn't. I was trying to demonstrate a calculation. Well, calculations for each dictionary size up to 2**20. 4. I detect apathy towards the fix, but I get the sense that it's mostly due to the fact that I initially didn't realize that I should have squarely limited the issue to the 1st secondary hash. And after I made the initial error in the calculation (as pointed out by Inada Naoki) which would have produced an unstable calculation of the 2nd, 3rd, 4th, etc. secondary hashes (as pointed out by Tim), there is a lot of push back. But the fix I propose DOES improve on ***1st secondary hash*** because it eliminates any possibility of collision. That is to say, no 2 1st secondary hashes will be the same for any 2 different primary hashes. And in the latest iteration of this proposal I did limit the scope of the enhancement to only 1st secondary hash. As it happens, this collision actually does not occur for dictionaries of sizes 8,16,32. They only begin to be possible for dictionaries of sizes 64 or greater. The latest script demonstrates it. 5. Sorry, Tim, but I should have reserved my statements to only the calculation I was certain about. The further comments about what happens to later secondary hashes was a speculation on my part and this calculation was not revealing much about it. If you want to consider what happens with later secondary indexes (as you do in hashsim.py), that's a different problem. To reiterate, my intention was to further improve on the fix in 28201. On the size of perturb: 6. Since hashes of strings and other objects are pointers, and can safely be assumed to be pointers addressing the heap, they are aligned on the boundary 2 * sizeof(void*). That's 16 in 64-bit builds. So the last 4 bits are 0. And (depending on how much heap is used) the higher bits of the pointers are also similar in applications which don't use a lot of memory. For example, in an application in which 16MiB of heap is used, only bits 35(=63-4-24) through 59(63-4) are truly random. The rest of the bits can be assumed to not change, but they are not reached until 5th iteration of the loop, so this represents an already mildly-degenerate case in this use case (in which most dictionary keys are strings). I haven't given much thought to which bits will have the most entropy in general, but that's going too far off topic. -- type: enhancement -> performance Added file: http://bugs.python.org/file46961/1st_secondary_collision_counts.py ___ Python tracker <http://bugs.python.org/issue30671> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com