On Sat, Oct 27, 2018 at 12:30 AM Antonin Houska <a...@cybertec.at> wrote: > 1. The return type of resize() function is void, so I propose part of the > header comment to be removed: > > diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c > index b46f7c4cfd..b2b8fe60e1 100644 > --- a/src/backend/lib/dshash.c > +++ b/src/backend/lib/dshash.c > @@ -672,9 +672,7 @@ delete_item(dshash_table *hash_table, dshash_table_item > *item) > > /* > * Grow the hash table if necessary to the requested number of buckets. The > - * requested size must be double some previously observed size. Returns true > - * if the table was successfully expanded or found to be big enough already > - * (because another backend expanded it). > + * requested size must be double some previously observed size. > * > * Must be called without any partition lock held. > */
Thanks, will fix. > 2. Can anyone please explain this macro? > > /* Max entries before we need to grow. Half + quarter = 75% load factor. */ > #define MAX_COUNT_PER_PARTITION(hash_table) \ > (BUCKETS_PER_PARTITION(hash_table->size_log2) / 2 + \ > BUCKETS_PER_PARTITION(hash_table->size_log2) / 4) > > I'm failing to understand why the maximum number of hash table entries in a > partition should be smaller than the number of buckets in that partition. > > The fact that MAX_COUNT_PER_PARTITION refers to entries and not buckets is > obvious from this condition in dshash_find_or_insert() > > /* Check if we are getting too full. */ > if (partition->count > MAX_COUNT_PER_PARTITION(hash_table)) Are you saying there is a bug in this logic (which is nbuckets * 0.75 expressed as integer maths), or saying that 0.75 is not a good maximum load factor? I looked around at a couple of general purpose hash tables and saw that some used 0.75 and some used 1.0, as a wasted space-vs-collision trade-off. If I have my maths right[1], with 0.75 you expect to have 75 entries in ~53 buckets, but with 1.0 you expect to have 100 entries in ~64 buckets. It'd be a fair criticism that that's arbitrarily different to the choice made for hash joins, and dynahash's default is 1 (though it's a run-time parameter). [1] https://math.stackexchange.com/questions/470662/average-number-of-bins-occupied-when-throwing-n-balls-into-n-bins -- Thomas Munro http://www.enterprisedb.com