https://www.postgresql.org/message-id/camp0ubeo3fzzefie1vmc1ajkkrpxlnzqooaseu6cccco-c+...@mail.gmail.com
In that thread, I pointed out some important considerations for the hash functions themselves. This is a follow-up, after I looked more carefully. 1. The hash functions as they exist today aren't portable -- they can return different results on different machines. That means using these functions for hash partitioning would yield different contents for the same partition on different architectures (and that's bad, considering they are logical partitions and not some internal detail). The core hashing algorithm is based on 32-bit chunks, so it's only portable if the input is an int32 (or array of int32s). That's not true for varchar (etc.) or numeric (which is based on an array of int16s). There is a hack for int8, but see issue #2 below. We could try to mark portable vs. unportable hash functions, but it seems quite valuable to be able to logically partition on varchar, so I think we should have some portable answer there. Another annoyance is that we would have to assume container types are unportable, or make them inherit the portability of the underlying type's hash function. We could revert 26043592 (copied Tom because that was his patch) to make hash_any() go back to being portable -- do we know what that speedup actually was? Maybe the benefit is smaller on newer processors? Another option is to try to do some combination of byteswapping and word-at-a-time, which might be better than byte-at-a-time if the byteswapping is done with a native instruction. 2. hashint8() is a little dubious. If the input is positive, it XORs the high bits; if negative, it XORs the complement of the high bits. That works for compatibility with hashint2/4, but it does not cause good hash properties[1]. I prefer that we either (a) upcast int2/4 to int8 before hashing and then hash the whole 64 bits; or (b) if the value is within range, downcast to int4, otherwise hash the whole 64 bits. 3. We should downcast numeric to int4/8 before hashing if it's in range, so that it can be a part of the same opfamily as int2/4/8. 4. text/varchar should use hashbpchar() so that they can be part of the same opfamily. Trailing blanks seem unlikely to be significant for most real-world data anyway. 5. For salts[2], I don't think it's too hard to support them in an optional way. We just allow the function to be a two-argument function with a default. Places that care about specifying the salt may do so if the function has pronargs==2, otherwise it just gets the default value. If we have salts, I don't think having 64-bit hashes is very important. If we run out of bits, we can just salt the hash function differently and get more hash bits. This is not urgent and I believe we should just implement salts when and if some algorithm needs them. Regards, Jeff Davis [1] You can a kind of mirroring in the hash outputs indicating bad mixing: postgres=# select hashint8((2^32-100)::int8); hashint8 ---------- 41320869 (1 row) postgres=# select hashint8((-2^32-100)::int8); hashint8 ---------- 41320869 (1 row) postgres=# select hashint8((2^32-101)::int8); hashint8 ------------- -1214482326 (1 row) postgres=# select hashint8((-2^32-101)::int8); hashint8 ------------- -1214482326 (1 row) [2] Not sure I'm using the term "salt" properly. I really just mean a way to affect the initial state, which I think is good enough for our purposes. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers