Eric Blake wrote: > Jim Meyering <jim <at> meyering.net> writes: >> Eric Blake wrote: >> > + /* Guarantee that once we start moving entries into the new table, >> > + we will not need any further memory allocations. We only need >> > + new allocations if the hasher consolidates multiple buckets from >> > + the old size into one bucket with the new size (a sign of a bad >> > + hasher, but we must deal with it). Therefore, we need merely >> > + check that there are at least table->n_buckets_used-1 entries on >> > + the free entry list for the worst case collision (everything gets >> > + combined into one bucket during the rehash). */ >> >> That strikes me as pessimistic. >> Imagine that we have a full table, with n_buckets_used == 20 >> and n_buckets == 23. >> >> When rehashing to say 31, odds are very good that we can >> get by with *no* entries on the free list -- if we're careful. >> So requiring 22 is a recipe for unnecessary failure. > > You are correct that in the common case, we probably won't need any extra free > entries. And I can prove that in the worst case, we need 19 free entries (not > 22), when all 20 buckets of the old table hash to the same bucket of the new
Yep. n_buckets_used - 1 is what I meant, not 22. > (and this would be easy to reproducibly test, by writing a hasher that is > well- > behaved for the initial bucket size but returns a constant for the new size). > So the question is now whether we want to be pessimistic and cause unnecessary > memory pressure, or optimistic but have more work to undo if our optimism was > wrong. But it is usually the right approach to cater to the common case, so > I'm starting to lean towards the optimistic approach with complicated undo if > we guessed wrong. > > My first patch attempt[1] was to provide cleanup on allocation failure; it was > only my second version that switched over to the pessimistic version of > guaranteeing no possibility of allocation failure to avoid the need for > cleanup > in the first place. The algorithm you describe below rather closely resembles > my first version (unfortunately, my first attempt is not posted in any public > repo at the moment). > > [1] http://article.gmane.org/gmane.comp.lib.gnulib.bugs/17849 > >> >> Right after your initial report, I began rewriting hash_rehash >> to work as follows: >> >> allocate new table >> >> [ optimization to minimize possibility of unnecessary failure: >> in the old table, >> liberate any overflow (chained) entries by moving "data" >> into empty buckets ] > > I hadn't thought of that. It frees (n_buckets - n_buckets_used) entries up > front, leaving less pressure for the rest of the algorithm. But it is > worthless for my pessimistic algorithm (why? because although you have more > free entries, you have also increased n_buckets_used by the same amount, so > you > have ended up only wasting time in moving data). Even for the optimistic > approach, it means that we that a failed allocation requires one more pass > through the original table to undo this effect. And in the optimal case, it > makes no difference (with a perfect hasher, there are no overflow entries in > the first place). At this point, I'm inclined to say that it probably won't > help, unless we can come up with a contrived hasher where we really do see > higher memory pressure mid-move than with either pre- or post- table, such > that > reducing pressure up front would be worthwhile. I agree. Just an idea, and probably not worthwhile. >> iterate through old table members, hashing them into the new table, >> >> [Possible optimization: process all overflow-entries before >> any single-entry bucket. But that means two passes. >> Alternatively: one pass through buckets, but for chains of >> length > 1, process all allocated entries before the first, >> non-allocated one, which (hence) might require an allocation >> in the new table. ] > > My first attempt used a full two-pass algorithm for the cleanup case. With > one > pass, the allocations and frees are intermingled, with two passes all frees > occur before any allocations. However, I have been unable (so far) to > contrive > any such hasher that would show a difference in worst-case pressure with the > intermingled operation. So, for better cache locality, I'd rather avoid two > passes in the common case. But we definitely want to code things to move the > overflow first and bucket head last within each bucket, as that is a trivial > rewrite with obvious benefits. I agree. Two-pass sounds expensive. >> If, in spite of all that, allocation is required, restore to >> the original table any entries that we've moved into the new one >> and free the new one and return false. However, restoring them >> may be tricky, so we'd have to be careful there, too. > > I think I've convinced myself that recovery is always possible without further > allocation, although I'm still unsure on whether there is ever anything where > the cleanup would require a two-pass algorithm because the one-pass approach > has higher pressure. > >> >> This approach has the added advantage of decreasing the memory >> footprint slightly, since we're more likely to reuse existing >> allocated entries than to allocate new ones that would eventually >> end up on the free list. > > It decreases the memory footprint of the table, but does increase the code > size. But that's the price of avoiding worst-case pessimism (and besides, a > code size increase of even a few k is probably much smaller than the cost of > pessimistically overallocating, where real world resizing will probably be in > the millions of hash table entries before memory pressure is an issue). > > Do you want me to resurrect the bits of my first patch, and submit something > that recovers from failure rather than preventing it? Sure! Thanks.