> -----Original Message----- > From: ffmpeg-devel <ffmpeg-devel-boun...@ffmpeg.org> On Behalf Of > Michael Niedermayer > Sent: Donnerstag, 17. April 2025 01:41 > To: FFmpeg development discussions and patches <ffmpeg- > de...@ffmpeg.org> > Subject: Re: [FFmpeg-devel] [PATCH 2/3] doc/dict2: Add doc and api > change for AVDictionary2 > > Hi > > On Wed, Apr 16, 2025 at 11:15:12PM +0000, softworkz . wrote: > > > > > > > -----Original Message----- > > > From: ffmpeg-devel <ffmpeg-devel-boun...@ffmpeg.org> On Behalf Of > > > Michael Niedermayer > > > Sent: Mittwoch, 16. April 2025 23:48 > > > To: FFmpeg development discussions and patches <ffmpeg- > > > de...@ffmpeg.org> > > > Subject: Re: [FFmpeg-devel] [PATCH 2/3] doc/dict2: Add doc and api > > > change for AVDictionary2 > > > > > > Hi softworkz > > > > > > I think we should use AI to support us and reduce the workload > > > on people. > > > I think this here cost you money > > > > This is part of an ongoing research for a project that is totally > > unrelated to FFmpeg. It wasn't my own money and it wasn't spent > > in order to create an AvDictionary2 for FFmpeg. > > > > > Also, I didn't know that you are working on it, you had written > > that you won't have time. That's why I thought it's a good subject, > > Yeah, I say i have no time and then spend time on it anyway ;)
I know that just too well - unfortunately 😊 > maybe thats one of several reasons why i dont have time > But AVMap surely is/was an interresting project > > There are just too many interresting things to work on > I need more time, the days are too short, life is too short > and i need an assistent > also we (FFMpeg) needs someone to > manage the bug tracker better. In the past carl did that > (ask people questions when reports where incomplete or unreproduceable > bisect regressions contact people causing regressions stuff like > that) > and i think we should fund carl to do it again. But until we find > someone funding carl, maybe you can get some AI to do a subset of > these tasks ? > also maybe we could train a LLM on the bugtracker data, so that > we then could just ask it questions about it. I am no expert on the subject, but from my understanding it doesn't work like that. When a model is trained on data, the information that it "learns" needs to be reflected at multiple places in the data for being "memorable". Singular data - like in the bugtracker is more like some kind of "noise" that will fall off the table. So, even when the trac data would be part of the training data, it wouldn't know about it in a per-ticket way - only recurring information patterns might stick, or maybe tickets that have been mentioned and/or discussed at multiple places within the whole base of data. Anyway, "training" a model requires Millions of dollars for the GPU clusters that are required to compute it. There's "fine tuning" - that's a kind of additional training on top of an existing model. But it has the same limitations and everywhere they are saying that this still needs large amounts of data for this to be effective. It still won't remember the trac database and Fine- tuning is also not something you'd do weekly to keep it up-to-date. What might be suitable for Fine Tuning is the ML content from the past 10 years (user and devel), but it would need to be pre- processed to exclude mails with patches/code and all e-mails from the unfriendly members here - that's surely not what you want to teach a model. Another option are vector databases. In this case, data doesn't become part of the model, it's rather a storage which the model can interact with (if supported). Yet, I don't have the impression that this the hottest cow on the field. More interesting are "embeddings". You need to pay for tokenizing the data you supply. It's the same operation that happens as a first step when you submit a message or anything. Those embeddings can be configured to be included in all conversations. It's more or less the same like when you provide any other input to the model - it's part of the conversation, but with an important difference: it doesn't add to the context window of the model which is limited by its max supported token length. Embeddings would be suitable to supply the FFmpeg source code, all other kinds of documents, the website content, the Wiki on trac and also instructions regarding its intended behavior etc. But still not suitable for the bug tracker content. Actually, this is not something that it needs to "know", it rather needs to be able to access it (just like us humans) via an APIs or browser automation. > the LLM would probably mix and confuse things and hallucinate > a lot of nonsense. That's less of a problem meanwhile as the available context windows have increased and operating on trac ticket discussions does not create such long conversations where the context window overflows and important parts fall off. Some care might only need to be taken for that it doesn't ingest really large log outputs as are sometimes included in the tickets. At this time, it would be still too bold to let it work fully autonomously, but that's not necessary because its operations could be easily arbitrated by conventional logic. It could be controlled by a set of tags - something like: - tracbot-error - tracbot-inconclusive - tracbot-needs-manual-review - tracbot-awaiting-user-response - tracbot-reproduced-in-master - tracbot-fixed-in-master Then, a scheduler service would run over all open issues and invoke the AI on it (see below). The scheduler would exclude tickets which already have one of those tags assigned. Additionally, it would include tickets that are tagged with "tracbot-awaiting-user-response" and have been updated since the tag was assigned. When the AI is invoked on a ticked, it has clear instructions to follow. The primary directive is to reproduce the reported issue. If the specified information is unclear or incomplete or when no test file is provided, it posts a message, asking for the missing information and applies the awaiting-user-response tag. The AI would have an execution environment in a Docker container where it has access to a library with daily builds from the past 5 years. If the issue doesn't reproduce with the latest daily build, it adds the tracbot-fixed-in-master tag. If it can be reproduced with the latest build, it "bisects" the issue using the daily binaries. It adds a message like: "Issue reproducible since version 20xx-xx-xx and the tag tracbot-reproduced-in-master If it can't make sense of it, or is platform-specific or needs certain hardware, or errors, it adds one of the other tags. Some safeguards must be added to avoid anybody getting into a longer chat with it (always ending with awaiting-user-response), but otherwise, I don't think that there's much that can go wrong. A mailing list could be set up, to which it reports it operations, and where interested members (or anybody) can subscribe to. This would provide a kind of real-time monitoring by the community. All-in-all I think it's well doable. Unfortunately though, I cannot spend that much time. Perhaps a candidate for GSoC? Best, sw _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".