Re: Do we want a hashset type?

Joel Jacobson Mon, 12 Jun 2023 13:36:38 -0700

On Mon, Jun 12, 2023, at 21:58, Tomas Vondra wrote:
> But those are numbers for large keys - if you restrict the input to
> 4B-16B (which is what we planned to do by focusing on int, bigint and
> uuid), there's no significant difference:


Oh, sorry, I completely failed to read the meaning of the Columns.

> lookup3:
>
> Small key speed test -    4-byte keys -    30.17 cycles/hash
> Small key speed test -    8-byte keys -    31.00 cycles/hash
> Small key speed test -   16-byte keys -    49.00 cycles/hash
>
> xxh3low:
>
> Small key speed test -    4-byte keys -    29.00 cycles/hash
> Small key speed test -    8-byte keys -    29.58 cycles/hash
> Small key speed test -   16-byte keys -    37.00 cycles/hash

The winner of the "Small key speed test" competition seems to be:

ahash64 "ahash 64bit":
Small key speed test -    4-byte keys -    24.00 cycles/hash
Small key speed test -    8-byte keys -    24.00 cycles/hash
Small key speed test -   16-byte keys -    26.98 cycles/hash

Looks like it's a popular one, e.g. it's used by Rust in their 
std::collections::HashSet.

Another notable property of ahash64 is that it's "DOS resistant",
but it isn't crucial for our use case, since we exclusively target
auto-generated primary keys which are not influenced by end-users.

> Not sure what you mean by "optimizing for space" - I imagined the
> on-disk format would just use the same hash table with tiny amount of
> free space (say 10% and not ~%50).

With "optimizing for space" I meant trying to find some alternative or
intermediate data structure that is more likely to be compressible,
like your idea of sorting the data.
What I wondered was if the on-disk format would be affected by
the choice of hash function. I guess it wouldn't, if the hashset
is created by adding the elements one-by-one by iterating
through the elements by reading the on-disk format.
But I thought maybe something more advanced could be
done, where conversion between the on-disk format
and the in-memory format could be done without naively
iterating through all elements, i.e. something less complex
than O(n).
No idea what that mechanism would be though.

> My suggestion is to be lazy, just use the lookup3 we have in hashfn.c
> (through hash_bytes or something), and at the same time make it possible
> to switch to a different function in the future. I'd store and ID of the
> hash function in the set, so that we can support a different algorithm
> in the future, if we choose to.

Sounds good to me.

Smart idea to include the hash function algorithm ID in the set,
to allow implementing a different one in the future!

/Joel

Re: Do we want a hashset type?

Reply via email to