On Mon, Aug 1, 2016 at 4:34 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.mu...@enterprisedb.com> writes: >> 2. I suspect that this algorithm for combining hashes is weak, and >> could amplify weaknesses in the hash functions feeding it. > > Very possibly, but ...
Concrete example: suppose a clever data type author defines a perfect hash function that maps values to small integers. In terms of hash collisions, this performs optimally in a single column hash join, agg, index etc by definition, and seems like an entirely reasonable thing to want to do, but performs terribly with Postgres's hash combination algorithm as soon as you use several columns. The stuffing is all up one end of the cushion, which is OK for a perfect hash on its own, but we do very little to spread it around when combining. For two columns that hash to 8 bit integers, we map 16 bits of information to only 9 bits: postgres=# select count(*) as distinct_outputs, postgres-# min(collisions), postgres-# max(collisions), postgres-# avg(collisions), postgres-# stddev(collisions) postgres-# from ( postgres(# select s1.i # (s2.i << 1), postgres(# count(*) as collisions postgres(# from generate_series(0, 255) as s1(i) postgres(# cross join generate_series(0, 255) as s2(i) postgres(# group by 1 postgres(# ) ss; ┌──────────────────┬─────┬─────┬──────────────────────┬────────┐ │ distinct_outputs │ min │ max │ avg │ stddev │ ├──────────────────┼─────┼─────┼──────────────────────┼────────┤ │ 512 │ 128 │ 128 │ 128.0000000000000000 │ 0 │ └──────────────────┴─────┴─────┴──────────────────────┴────────┘ (1 row) The Boost combiner does better. If I have this right: postgres=# create or replace function hash_combine(seed bigint, hash bigint) postgres-# returns bigint as postgres-# $$ postgres$# select (seed # (hash + 2654435769 + (seed << 6) + (seed >> 2))) postgres$# % 4294967296; postgres$# $$ language sql; CREATE FUNCTION postgres=# postgres=# select count(*) as distinct_outputs, postgres-# min(collisions), postgres-# max(collisions), postgres-# avg(collisions), postgres-# stddev(collisions) postgres-# from ( postgres(# select hash_combine(hash_combine(0, s1.i), s2.i), postgres(# count(*) as collisions postgres(# from generate_series(0, 255) as s1(i) postgres(# cross join generate_series(0, 255) as s2(i) postgres(# group by 1 postgres(# ) ss; ┌──────────────────┬─────┬─────┬────────────────────┬────────────────────────┐ │ distinct_outputs │ min │ max │ avg │ stddev │ ├──────────────────┼─────┼─────┼────────────────────┼────────────────────────┤ │ 16583 │ 1 │ 9 │ 3.9519990351564856 │ 0.70952439864334926106 │ └──────────────────┴─────┴─────┴────────────────────┴────────────────────────┘ (1 row) Not as good as I hoped (maybe there is a bug in my query?), but not a 128 car pile-up. There are probably other kinds of specialised (or poorly written) hash functions that will work well or acceptably on their own but not so well when combined without good mixing. Like Boost, we are dealing with hashes produced by arbitrary hash functions that can be supplied by user code, not just by the high quality built-in function from Bob Jenkins. >> Compare >> Boost's hash_combine, which does some more work before XORing: > >> seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2); > > I can't help being reminded of Knuth's story about he tried to invent > the world's best random number generator, and was disappointed when > it almost immediately converged to a short repeating sequence. If > there's any actual theoretical basis to the above, I'd be interested > to see it. But as-is, the use of addition rather than XOR looks fishy, > and the way the old seed is shifted around looks more likely to result > in cancellation than anything else. I provided boost::hash_combine as an example only because it's widely known and used in used in C++ circles and familiar to me, but so far I have only taken it on authority that it works. Here's what I managed to dig up on it this morning: According to C++ N3876[1], a proposal to add std::hash_combine (but not a specific algorithm) to the C++ standard library, Boost's algorithm comes from a paper called "Methods for Identifying Versioned and Plagiarised Documents (2003)"[2], where you can see this recipe for shifting and summing as equation 1, with some discussion and pointers to other papers. But I see in N3876's feedback (see the comment from Jeffrey Jasskin) that Bob Jenkins' mixer (or at least an aspect of it: not requiring intermediate state to fit in the final output size) is discussed as a potential implementation. We already have that in our source tree... perhaps we should use it. >> That constant approximates the golden ratio (as a fraction of the 32 >> bit hash space), and it also appears in our hash_any and hash_uint32 >> functions. > > I think it's just somebody's idea of a magic random number. Your link >> [3] http://cstheory.stackexchange.com/questions/9639/how-did-knuth-derive-a > provides some reasons to think it might be a good choice for a very > specific application, but this is not that application --- in particular, > it doesn't involve multiplying by that constant. Yeah. About its use in his 1997 algorithm, Bob Jenkins says: "The golden ratio really is an arbitrary value. Its purpose is to avoid mapping all zeros to all zeros." [1] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3876.pdf [2] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.2680 -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers