Perhaps I can add a `mode` enum parameter to the FFHashtableContext to control behavior? Then we can benchmark different behaviors on a per-use-case basis.
On Tue, Apr 22, 2025 at 10:24 AM Nicolas George <geo...@nsup.org> wrote: > > Emma Worley (HE12025-04-22): > > Adds a generic hash table with the DXV encoder as an initial use case. > > Hi. > > This code is already really good. I have some local remarks that will > not require a lot of work from you and make it better. And I have some > global remarks that would require a lot of work from you and might shave > a few cycles. For the rice wine of clarity, I have decided to send them > separately. This is the mail about the later. > > After writing most of what follows I thought of checking how this data > structure is used in practice. What I wrote is more relevant as the size > of the entries grows, but in practice the entries are very small, which > means the current version is probably close to the best. Still, I wrote > it, I might as well send it. > > As I said, the code is already really good. I do not consider necessary > to act on what I will say here, or even to read it. But it may contain > ideas to do even better, for you or somebody else, for now or for later. > > This is about hash tables and algorithmic. > > Vocabulary as I will use it here: > > Hash bucket, or bucket: integer-indexed cell where the hash function > directs us to start searching for a key. > > Entry: key and its associated value, as wanted by the code that uses the > hash table. > > Entry slot, or slot: memory that can store an entry, especially relevant > if pre-allocated. > > We all know the core difficulty with hash tables: different keys can map > to the same hash value and therefore hash bucket. There are two kinds of > solutions to this collision issue: > > - try to find another hash bucket to find an available entry slot; > > - make hash buckets capable of holding multiple entries, typically by > the means of some kind of linked list. > > In the most C and simplistic implementation, hash buckets and entry > slots are the same, and the code, common to add, get, del, just tries > the next bucket until it finds the right key or a hole. > > In the most Java/GLib implementation, hash buckets are just pointers and > each entry is a separately allocated slot. (Though I am sure the > efficient Java hash tables are written much more in the C style.) > > The issue with the second way is that it requires an extra number > (pointer/index, whatever) per bucket and per entry slot. > > The first way has two problems. Clustering ruins the efficiency the > table. And deletion becomes hard, because just leaving a hole would > prevent from finding an entry that has the be stored after the hole > compared to its ideal place. > > Clustering can be reduced with a secondary hash function (or even more > subtle): instead of looking at bucket h+1, look at bucket h+h'. Make > sure h' and the size of the hash table do not have common factors. But > that does not solve the issue of deletions. > > I was not familiar with this “Robin Hood” algorithm. From what I > understand, it is a solution to both clustering and deletion. Excellent. > > Except it costs a number per bucket/slot. > > If we consider spending extra memory, we have to consider if the linked > list solution might not be more efficient. > > This hash table has constant size. That makes it easy to pre-allocate > all slots in a big array. It can be the same array as the buckets or a > separate one. > > If we keep buckets = slots and if we accept to move an entry on > occasion, then we can make it work with just one number per > entry/bucket. When adding a new entry, if the bucket is not empty, check > if it is a collision: same hash value? If it is a collision, then take > any empty bucket/slot (keeping them chained is easy) for the new entry > and link it to the existing one. If it is not a collision, then take the > bucket/slot for the new entry after moving the entry that was there into > an empty bucket. > > What I described above is not very nice, but it is the best I know / can > find that only costs one number per bucket/slot. If we can afford to > spend more memory on it, we can do better by the virtue of breaking the > buckets = slots equivalence. > > It is especially interesting if the entries are large, because this > version does not require copying entries. Also, we can have more buckets > than slots, reducing collisions. > > All of this is very dependant on specifics. In particular, cache > locality can make a solution that seems slightly worse actually better. > It would require extensive benchmarking. > > Regards, > > -- > Nicolas George > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".