Re: hash_rehash allocatino failure

Jim Meyering Wed, 17 Jun 2009 14:44:04 -0700

Eric Blake wrote:
> Jim Meyering <jim <at> meyering.net> writes:
>> Eric Blake wrote:
>> > +  /* Guarantee that once we start moving entries into the new table,
>> > +     we will not need any further memory allocations.  We only need
>> > +     new allocations if the hasher consolidates multiple buckets from
>> > +     the old size into one bucket with the new size (a sign of a bad
>> > +     hasher, but we must deal with it).  Therefore, we need merely
>> > +     check that there are at least table->n_buckets_used-1 entries on
>> > +     the free entry list for the worst case collision (everything gets
>> > +     combined into one bucket during the rehash).  */
>>
>> That strikes me as pessimistic.
>> Imagine that we have a full table, with n_buckets_used == 20
>> and n_buckets == 23.
>>
>> When rehashing to say 31, odds are very good that we can
>> get by with *no* entries on the free list -- if we're careful.
>> So requiring 22 is a recipe for unnecessary failure.
>
> You are correct that in the common case, we probably won't need any extra free
> entries.  And I can prove that in the worst case, we need 19 free entries (not
> 22), when all 20 buckets of the old table hash to the same bucket of the new


Yep.  n_buckets_used - 1 is what I meant, not 22.

> (and this would be easy to reproducibly test, by writing a hasher that is 
> well-
> behaved for the initial bucket size but returns a constant for the new size).
> So the question is now whether we want to be pessimistic and cause unnecessary
> memory pressure, or optimistic but have more work to undo if our optimism was
> wrong.  But it is usually the right approach to cater to the common case, so
> I'm starting to lean towards the optimistic approach with complicated undo if
> we guessed wrong.
>
> My first patch attempt[1] was to provide cleanup on allocation failure; it was
> only my second version that switched over to the pessimistic version of
> guaranteeing no possibility of allocation failure to avoid the need for 
> cleanup
> in the first place.  The algorithm you describe below rather closely resembles
> my first version (unfortunately, my first attempt is not posted in any public
> repo at the moment).
>
> [1] http://article.gmane.org/gmane.comp.lib.gnulib.bugs/17849
>
>>
>> Right after your initial report, I began rewriting hash_rehash
>> to work as follows:
>>
>>   allocate new table
>>
>>   [ optimization to minimize possibility of unnecessary failure:
>>     in the old table,
>>       liberate any overflow (chained) entries by moving "data"
>>         into empty buckets ]
>
> I hadn't thought of that.  It frees (n_buckets - n_buckets_used) entries up
> front, leaving less pressure for the rest of the algorithm.  But it is
> worthless for my pessimistic algorithm (why? because although you have more
> free entries, you have also increased n_buckets_used by the same amount, so 
> you
> have ended up only wasting time in moving data).  Even for the optimistic
> approach, it means that we that a failed allocation requires one more pass
> through the original table to undo this effect.  And in the optimal case, it
> makes no difference (with a perfect hasher, there are no overflow entries in
> the first place).  At this point, I'm inclined to say that it probably won't
> help, unless we can come up with a contrived hasher where we really do see
> higher memory pressure mid-move than with either pre- or post- table, such 
> that
> reducing pressure up front would be worthwhile.

I agree.  Just an idea, and probably not worthwhile.

>>   iterate through old table members, hashing them into the new table,
>>
>>     [Possible optimization: process all overflow-entries before
>>      any single-entry bucket.  But that means two passes.
>>      Alternatively: one pass through buckets, but for chains of
>>      length > 1, process all allocated entries before the first,
>>      non-allocated one, which (hence) might require an allocation
>>      in the new table. ]
>
> My first attempt used a full two-pass algorithm for the cleanup case.  With 
> one
> pass, the allocations and frees are intermingled, with two passes all frees
> occur before any allocations.  However, I have been unable (so far) to 
> contrive
> any such hasher that would show a difference in worst-case pressure with the
> intermingled operation.  So, for better cache locality, I'd rather avoid two
> passes in the common case.  But we definitely want to code things to move the
> overflow first and bucket head last within each bucket, as that is a trivial
> rewrite with obvious benefits.

I agree.  Two-pass sounds expensive.

>>    If, in spite of all that, allocation is required, restore to
>>    the original table any entries that we've moved into the new one
>>    and free the new one and return false.  However, restoring them
>>    may be tricky, so we'd have to be careful there, too.
>
> I think I've convinced myself that recovery is always possible without further
> allocation, although I'm still unsure on whether there is ever anything where
> the cleanup would require a two-pass algorithm because the one-pass approach
> has higher pressure.
>
>>
>> This approach has the added advantage of decreasing the memory
>> footprint slightly, since we're more likely to reuse existing
>> allocated entries than to allocate new ones that would eventually
>> end up on the free list.
>
> It decreases the memory footprint of the table, but does increase the code
> size.  But that's the price of avoiding worst-case pessimism (and besides, a
> code size increase of even a few k is probably much smaller than the cost of
> pessimistically overallocating, where real world resizing will probably be in
> the millions of hash table entries before memory pressure is an issue).
>
> Do you want me to resurrect the bits of my first patch, and submit something
> that recovers from failure rather than preventing it?

Sure!  Thanks.

Re: hash_rehash allocatino failure

Reply via email to