On Mon, Jun 12, 2023, at 21:58, Tomas Vondra wrote: > But those are numbers for large keys - if you restrict the input to > 4B-16B (which is what we planned to do by focusing on int, bigint and > uuid), there's no significant difference:
Oh, sorry, I completely failed to read the meaning of the Columns. > lookup3: > > Small key speed test - 4-byte keys - 30.17 cycles/hash > Small key speed test - 8-byte keys - 31.00 cycles/hash > Small key speed test - 16-byte keys - 49.00 cycles/hash > > xxh3low: > > Small key speed test - 4-byte keys - 29.00 cycles/hash > Small key speed test - 8-byte keys - 29.58 cycles/hash > Small key speed test - 16-byte keys - 37.00 cycles/hash The winner of the "Small key speed test" competition seems to be: ahash64 "ahash 64bit": Small key speed test - 4-byte keys - 24.00 cycles/hash Small key speed test - 8-byte keys - 24.00 cycles/hash Small key speed test - 16-byte keys - 26.98 cycles/hash Looks like it's a popular one, e.g. it's used by Rust in their std::collections::HashSet. Another notable property of ahash64 is that it's "DOS resistant", but it isn't crucial for our use case, since we exclusively target auto-generated primary keys which are not influenced by end-users. > Not sure what you mean by "optimizing for space" - I imagined the > on-disk format would just use the same hash table with tiny amount of > free space (say 10% and not ~%50). With "optimizing for space" I meant trying to find some alternative or intermediate data structure that is more likely to be compressible, like your idea of sorting the data. What I wondered was if the on-disk format would be affected by the choice of hash function. I guess it wouldn't, if the hashset is created by adding the elements one-by-one by iterating through the elements by reading the on-disk format. But I thought maybe something more advanced could be done, where conversion between the on-disk format and the in-memory format could be done without naively iterating through all elements, i.e. something less complex than O(n). No idea what that mechanism would be though. > My suggestion is to be lazy, just use the lookup3 we have in hashfn.c > (through hash_bytes or something), and at the same time make it possible > to switch to a different function in the future. I'd store and ID of the > hash function in the set, so that we can support a different algorithm > in the future, if we choose to. Sounds good to me. Smart idea to include the hash function algorithm ID in the set, to allow implementing a different one in the future! /Joel