On Tue, Apr 08, 2025 at 11:10:21AM -0500, Romain Beauxis wrote: > Le mar. 8 avr. 2025 à 05:20, Michael Niedermayer > <mich...@niedermayer.cc> a écrit : > > > > Hi all > > > > As i have too many things to do already i did the most logic thing and > > started thinking about a new and unrelated idea. > > > > This is a list of problems and ideas, that everyone is welcome to add to and > > comment on. > > > > AVDictionary is just bad. > > > > * its complicated internally with > > unneeded alternative (AV_DICT_DONT_STRDUP_VAL/KEY) these are rarely used > > and probably not relevant for performance. > > > > * all basic operations are as slow as possible. > > you want to find, update or remove an entry, search through all entries > > > > * its heavy on memory allocations > > 1 malloc for key, 1 malloc for value, 1 realloc on the AVDictionaryEntry > > array > > that makes 2+ malloc() for every "foo"="bar" > > > > Ideas: > > 1. put the node struct (AVDictionaryEntry), the key and value in the same > > allocated block, 1 malloc() instead of 2. > > We can simply concatenate the key and value string, we could even use the > > 0 terminator instead of the 2nd pointer. Either way the whole > > can go to the end of the Node structure for a tree > > 1b. Now if we did put the key and value together, we can order in the tree > > by this combined entity. Why ? because now we have a unique ordering > > and also the key+value could be required to be always unique. Simplifying > > things from what we have now and making it more replicatable, no > > more changes in output because order changed > > 2. We have a simple AVL tree implementation which we could use to make > > all operations O(log n) instead of O(n) > > 3. We could go with hash tables, splay trees, critbit trees or something > > else. hash tables have issues with malicious/odd input which would > > require more complexity to workaround. > > > > Of course we could also go a step further and eliminate the malloc per > > node and put it all in a linear array. > > As in, insert -> append at the end, > > realloc with every power of 2 size increase > > complete rebuild once enough elements are removed > > not sure this isnt overkill for a metadata string dictionary > > > > I probably wont have time to implement this in the near future but as i > > was thinking about this, it seemed to make sense to write this down and > > post here > > > > git grep av_dict | wc is 1436 > > > > So its used a bit, justifying looking at improving it > > > > > > git grep AV_DICT_DONT_STRDUP | wc is 87 > > git grep AV_DICT_DONT_STRDUP libavutil/ tests doc | wc is 20 > > > > Seems not too common and one malloc/copy of a string once per metadata entry > > which is once per file generally, seems a strange optimization to me > > Some questions that could be relevant: [...]
> > * Any interest in storing multiple values for the same key? This seems > like a niche case but, as you pointed out in another thread, typically > vorbis metadata do allow multiple key/values for the same field. For a single key multiple values should not be stored You can do Author1=Eve Author2=Adam or Author=Adam and Eve But dont do Author=Eve Author=Adam because if you do that and then you get later a Author=Lilith what does that mean? that its now 1 Author or 3 Authors or 2 and if 2 then which 2 ? Or said another way, you cant have multiple identical keys like that AND allow updates. > > * Any interest in storing an optional encoding value for text strings? encoding is UTF-8 unicode you can use "Private Use Areas" within unicode if you want to export characters that the source failed to map to unicode How we do this exactly is up to debate. But it seems more powerfull and simpler for te user than to require the user app to handle every encoding There are 4 potential cases, i think 1. We are sure what a symbol means and we return only that in unicode 2. We are sure what a symbol means and we return that in unicode AND the source 8bit char in a PUA 3. We are not sure what a symbol means and we return our best guess in unicode AND the source 8bit char in a PUA 4. We are not sure what a symbol means and we return the source 8bit char in a PUA only If we do that we would need 512 values from a PUA 2 sets of 256 one to follow up on the last unicode symbol and one that comes alone > This could be very useful to increase interoperability between legacy > systems. Typically, a lot of icecast ICY metadata are still passed as > latin1. This way, the library could pass them unchanged and let the > system decide what to do with them. If we do the suggested PUA case above, the a muxer could use either the standard unicode or PUA values > > * Any interest in having alternative value for key names? Most > metadata systems carry their own naming conventions that are then > mapped to conventional/normalized names like TIT2 for title in ID3v2 > frames. Having key name aliases could allow the library to refer to > their own normalized values while allowing a transparent end-to-end > handling of e.g. ID3v2 where you could dump the exact same frames > using their native keys. maybe the native keys could be attached somehow as extra information iam not sure about complexity though Title(TIT2)=... > > * Similarly, any interest in carrying a source indicator? One of the > reasons the recent AV_DICT_DEDUP commit as suggested was to deal with > the same metadata key coming from two different sources. With a. > source indicator you can let the metadata flow end-to-end and let the > user make decisions about what to do in these cases. This feels like very similar to the "TIT2" case above thx [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB Never trust a computer, one day, it may think you are the virus. -- Compn
signature.asc
Description: PGP signature
_______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".