On Wed, 13 Oct 2021 at 20:16, Peter Geoghegan <p...@bowt.ie> wrote: > > On Wed, Oct 13, 2021 at 3:44 AM Simon Riggs > <simon.ri...@enterprisedb.com> wrote: > > > IMO it'd be nice to show some numbers to support the claims that storing > > > the extra hashes and/or 8B hashes is not worth it ... > > > > Using an 8-byte hash is possible, but only becomes effective when > > 4-byte hash collisions get hard to manage. 8-byte hash also makes the > > index 20% bigger, so it is not a good default. > > Are you sure? I know that nbtree index tuples for a single-column int8 > index are exactly the same size as those from a single column int4 > index, due to alignment overhead at the tuple level. So my guess is > that hash index tuples (which use the same basic IndexTuple > representation) work in the same way.
The hash index tuples are 20-bytes each. If that were rounded up to 8-byte alignment, then that would be 24 bytes. Using pageinspect, the max(live_items) on any data page (bucket or overflow) is 407 items, so they can't be 24 bytes long. Other stats of interest would be that the current bucket design/page splitting is very effective at maintaining distribution. On a hash index for a table with 2 billion rows in it, with integer values from 1 to 2billion, there are 3670016 bucket pages and 524286 overflow pages, distributed so that 87.5% of buckets have no overflow pages, and 12.5% of buckets have only one overflow page; there are no buckets with >1 overflow page. The most heavily populated overflow page has 209 items. The CREATE INDEX time is fairly poor at present, but that can be optimized easily enough, but I expect to do that after uniqueness is added, since it would complicate the code to do that work in a different order. -- Simon Riggs http://www.EnterpriseDB.com/