Jim Meyering <jim <at> meyering.net> writes: > Long ago (like 12-15 years), in an application that made very heavy use > of hash tables, the obstack code provided a large performance benefit.
Indeed, if you are frequently calling allocate_entry, the cost of malloc'ing lots of small objects is measurably worse than using an obstack to malloc large chunks and carve off entries as needed. In other words, it's not the speedup from freeing multiple entries, so much as the speedup from dealing with collisions when inserting them along with a reduction in memory usage, that makes obstack-managed entry allocation attractive even with modern heap management. For that matter, my patch to hash_rehash penalizes the USE_OBSTACK case by calling obstack_alloc and updating the free_entry_list once per needed free entry, where it could instead use obstack_grow to guarantee enough space without bothering with free_entry_list, so if we spend time improving USE_OBSTACK, I will work on a followup to provide this optimization. > So I'm reluctant to remove it altogether. However, it's been so long > since I last tested it that it may well have suffered from bit rot. > And with modern heap management code, it may not be useful anymore. My code inspection didn't turn up obvious problems, but I have not yet tested with USE_OBSTACK either, so I wouldn't be surprised to find bit rot. If we do get it working, would it be worth making the use of obstacks a runtime decision via hash_tuning, rather than a compile-time decision? There might also be alternative implementations possible with better performance. For example, instead of malloc'ing free entries one at a time and tracking bucket overflows as pointers to malloc'd objects, we could instead malloc/realloc an array of overflows, with bucket overflows tracked as indices into that array. This would provide some of the same benefits as obstack allocation (no per-entry malloc overhead, and fewer calls to malloc and with larger chunks of malloc claimed per call), such that it is no longer necessary to use an obstack for free-entry management in the first place. Another hash implementation I am familiar with treats the slots array as a circular array, and has no overflow linked lists (thus, the entire hash table is contained within a single malloc'd block, rather than the hash.c implementation of one block for bucket heads and other blocks for collisions). Instead, the hasher merely determines where in the circular array to start searching, and on collisions, you end up searching until you hit a free slot. But this has other ramifications; searching for a match means traversing through the array until you encounter an empty slot, and you no longer have the guarantee that the comparison function will always be called with two elements with the same hash. Also, whereas linked lists only search the set of collisions, a nearly full circular array may end up searching the bulk of the table before it arrives at the next free entry. However, I have never benchmarked this implementation to see how it fares when compared with the more traditional bucket of linked lists. We may also want to borrow from Bruno's idioms in the gl_list arena, where hash_initialize is implemented as setting up a vtable optimized to an appropriate implementation, such that you could then experiment with different implementations by swapping the vtable at initialization time. > > I've begun writing a test module, and may even find the time to test > the USE_OBSTACK code, too. As you say, it also needs to be improved > to handle allocation failure, probably the same way remove.c now does: > > http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=a5111af33ea6a5d27 Unrelated to this topic, but in looking at that patch, I noticed you added the following line: cleanup:; Any reason for the empty statement after the label? Is this worth a syntax check in maint.mk? -- Eric Blake