Tim Peters <t...@python.org> added the comment:
Here's a complete xxHash-based implementation via throwing away C++-isms from https://github.com/RedSpah/xxhash_cpp/blob/master/xxhash/xxhash.hpp In the 64-bit build there are no collisions across my tests except for 11 in the new tuple test. The 32-bit build fares worse: - 3 collisions in the old tuple test. - 51 collisions in the new tuple test. - 173 collisions across product([-3, 3], repeat=20) - 141 collisions across product([0.5, 0.25], repeat=20) - no other collisions The expected number of collisions from tossing 2**20 balls into 2**32 buckets is about 128, with standard deviation 11.3. So 141 is fine, but 173 is highly suspect. 51 for the new tuple test is way out of bounds. But I don't much care. None of these are anywhere near disasters, and - as I've already said - I know of no non-crypto strength hash that doesn't suffer "worse than random" behaviors in some easily-provoked cases. All you have to do is keep trying. Much as with SeaHash, adding t ^= t << 1; at the top helps a whole lot in the "bad cases", cutting the new test's collisions to 7 in the 32-bit build. It also cuts the product([-3, 3], repeat=20) collisions to 109. But boosts the old tuple test's collisions to 12. I wouldn't do it: it adds cycles for no realistic gain. We don't really care whether the hash "is random" - we do care about avoiding catastrophic pile-ups, and there are none with an unmodified xxHash. BTW, making that change also _boosts_ the number of "new test" collisions to 23 in the 64-bit build (a little over its passing limit). #define Py_NHASHBITS (SIZEOF_PY_HASH_T * CHAR_BIT) #if Py_NHASHBITS >= 64 # define Py_HASH_MULT1 (Py_uhash_t)11400714785074694791ULL # define Py_HASH_MULT2 (Py_uhash_t)14029467366897019727ULL # define Py_HASH_LSHIFT 31 #elif Py_NHASHBITS >= 32 # define Py_HASH_MULT1 (Py_uhash_t)2654435761ULL # define Py_HASH_MULT2 (Py_uhash_t)2246822519ULL # define Py_HASH_LSHIFT 13 #else # error "can't make sense of Py_uhash_t size" #endif #define Py_HASH_RSHIFT (Py_NHASHBITS - Py_HASH_LSHIFT) static Py_hash_t tuplehash(PyTupleObject *v) { Py_uhash_t x, t; /* Unsigned for defined overflow behavior. */ Py_hash_t y; Py_ssize_t len = Py_SIZE(v); PyObject **p; x = 0x345678UL; p = v->ob_item; while (--len >= 0) { y = PyObject_Hash(*p++); if (y == -1) return -1; t = (Py_uhash_t)y; x += t * Py_HASH_MULT2; x = (x << Py_HASH_LSHIFT) | (x >> Py_HASH_RSHIFT); x *= Py_HASH_MULT1; } x += 97531UL; if (x == (Py_uhash_t)-1) x = -2; return x; } #undef Py_NHASHBITS #undef Py_HASH_MULT1 #undef Py_HASH_MULT2 #undef Py_HASH_LSHIFT #undef Py_HASH_RSHIFT ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue34751> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com