hashtable: create generic robin hood hash table

Emma Worley Tue, 22 Apr 2025 10:57:58 -0700

Perhaps I can add a `mode` enum parameter to the FFHashtableContext to
control behavior? Then we can benchmark different behaviors on a
per-use-case basis.


On Tue, Apr 22, 2025 at 10:24 AM Nicolas George <geo...@nsup.org> wrote:
>
> Emma Worley (HE12025-04-22):
> > Adds a generic hash table with the DXV encoder as an initial use case.
>
> Hi.
>
> This code is already really good. I have some local remarks that will
> not require a lot of work from you and make it better. And I have some
> global remarks that would require a lot of work from you and might shave
> a few cycles. For the rice wine of clarity, I have decided to send them
> separately. This is the mail about the later.
>
> After writing most of what follows I thought of checking how this data
> structure is used in practice. What I wrote is more relevant as the size
> of the entries grows, but in practice the entries are very small, which
> means the current version is probably close to the best. Still, I wrote
> it, I might as well send it.
>
> As I said, the code is already really good. I do not consider necessary
> to act on what I will say here, or even to read it. But it may contain
> ideas to do even better, for you or somebody else, for now or for later.
>
> This is about hash tables and algorithmic.
>
> Vocabulary as I will use it here:
>
> Hash bucket, or bucket: integer-indexed cell where the hash function
> directs us to start searching for a key.
>
> Entry: key and its associated value, as wanted by the code that uses the
> hash table.
>
> Entry slot, or slot: memory that can store an entry, especially relevant
> if pre-allocated.
>
> We all know the core difficulty with hash tables: different keys can map
> to the same hash value and therefore hash bucket. There are two kinds of
> solutions to this collision issue:
>
> - try to find another hash bucket to find an available entry slot;
>
> - make hash buckets capable of holding multiple entries, typically by
>   the means of some kind of linked list.
>
> In the most C and simplistic implementation, hash buckets and entry
> slots are the same, and the code, common to add, get, del, just tries
> the next bucket until it finds the right key or a hole.
>
> In the most Java/GLib implementation, hash buckets are just pointers and
> each entry is a separately allocated slot. (Though I am sure the
> efficient Java hash tables are written much more in the C style.)
>
> The issue with the second way is that it requires an extra number
> (pointer/index, whatever) per bucket and per entry slot.
>
> The first way has two problems. Clustering ruins the efficiency the
> table. And deletion becomes hard, because just leaving a hole would
> prevent from finding an entry that has the be stored after the hole
> compared to its ideal place.
>
> Clustering can be reduced with a secondary hash function (or even more
> subtle): instead of looking at bucket h+1, look at bucket h+h'. Make
> sure h' and the size of the hash table do not have common factors. But
> that does not solve the issue of deletions.
>
> I was not familiar with this “Robin Hood” algorithm. From what I
> understand, it is a solution to both clustering and deletion. Excellent.
>
> Except it costs a number per bucket/slot.
>
> If we consider spending extra memory, we have to consider if the linked
> list solution might not be more efficient.
>
> This hash table has constant size. That makes it easy to pre-allocate
> all slots in a big array. It can be the same array as the buckets or a
> separate one.
>
> If we keep buckets = slots and if we accept to move an entry on
> occasion, then we can make it work with just one number per
> entry/bucket. When adding a new entry, if the bucket is not empty, check
> if it is a collision: same hash value? If it is a collision, then take
> any empty bucket/slot (keeping them chained is easy) for the new entry
> and link it to the existing one. If it is not a collision, then take the
> bucket/slot for the new entry after moving the entry that was there into
> an empty bucket.
>
> What I described above is not very nice, but it is the best I know / can
> find that only costs one number per bucket/slot. If we can afford to
> spend more memory on it, we can do better by the virtue of breaking the
> buckets = slots equivalence.
>
> It is especially interesting if the entries are large, because this
> version does not require copying entries. Also, we can have more buckets
> than slots, reducing collisions.
>
> All of this is very dependant on specifics. In particular, cache
> locality can make a solution that seems slightly worse actually better.
> It would require extensive benchmarking.
>
> Regards,
>
> --
>   Nicolas George
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH v4 1/3] lavc/hashtable: create generic robin hood hash table

Reply via email to