Jim Meyering <jim <at> meyering.net> writes: > > Additionally, it looks like hash_rehash has a memory leak - if new_table > > is allocated, but we later fail to allocate new_entry, the function > > returns immediately without reclaiming new_table. Which means that even > > if an application is prepared to deal with a false return by trying other > > means to reduce memory usage rather than just giving up with xalloc_die, > > the leaked memory from the previous attempt will interfere. > > Good catch. > That is definitely a bug.
I found another bug in hash.c. hash_rehash can get the table into a state where it will NEVER rehash again, regardless of how much else you call hash_insert. I triggered this bug in m4 by hashing the strings "m1", "m2", ... "m100000"; m4 uses the default hash_tuning, and an initial capacity request of 511. It also uses the following hash function (the hash is over the string plus its length, to allow hashing embedded NUL bytes): static size_t symtab_hasher (const void *entry, size_t buckets) { const symbol *sym = (const symbol *) entry; return hash (SYMBOL_NAME (sym), SYMBOL_NAME_LEN (sym)) % buckets; } static size_t hash (const char *s, size_t len) { size_t val = len; /* This algorithm was originally borrowed from GNU Emacs, but has been modified to allow embedded NUL. */ while (len--) val = (val << 7) + (val >> (sizeof val * CHAR_BIT - 7)) + to_uchar (*s++); return val; } Note what happens in the call to hash_rehash, just after everything has been copied to the new table, but before the old table is freed: (gdb) p *table $1 = {bucket = 0x716318, bucket_limit = 0x717720, n_buckets = 641, n_buckets_used = 513, n_entries = 3289, tuning = 0x443cd8, hasher = 0x417ab5 <symtab_hasher>, comparator = 0x417af1 <symtab_comparator>, data_freer = 0x417b45 <symtab_free_entry>, free_entry_list = 0x0} (gdb) p *new_table $2 = {bucket = 0x798970, bucket_limit = 0x79a5c8, n_buckets = 907, n_buckets_used = 907, n_entries = 0, tuning = 0x443cd8, hasher = 0x417ab5 <symtab_hasher>, comparator = 0x417af1 <symtab_comparator>, data_freer = 0x417b45 <symtab_free_entry>, free_entry_list = 0x75b100} Before the rehash, about 20% of the buckets were empty. But after changing the prime factor, the hasher function scatters better, and _every single bucket_ is occupied. Which means ALL subsequent calls to hash_insert will trigger an overflow, rather than inserting into a bucket head. And since the resize checking logic in hash_rehash is only triggered after insertion into a bucket head, the hash table has effectively been locked into a fixed size that will never grow again, no matter how many hash_insert calls are made, even though the table is certainly exceeding the requested threshold of 20% empty buckets. Ouch. In other words, this statement in hash_rehash: /* If the growth threshold of the buckets in use has been reached, increase the table size and rehash. There's no point in checking the number of entries: if the hashing function is ill-conditioned, rehashing is not likely to improve it. */ is not quite right. A hashing function may be somewhat ill-conditioned for one bucket size, but much better conditioned for another size. At any rate, it looks like the correct fix is to check for rehashing thresholds based on number of entries, not number of buckets, and that this check needs to happen on every insertion, not just insertions that land in an empty bucket. Or, at the very least, check for threshold PRIOR to insertion, rather than after, such that in the above case with m4, the very next call to hash_insert would notice the table was at 100%, and force another resize above 907 buckets. Thoughts, before I implement something along these lines? -- Eric Blake