[insert meme here] (this will be a long e-mail)
Dear FFmpeg devs, in the past days I've been experimenting hacking FFmpeg using Rust. As I am becoming more familiar with the libavfilter, and it is not a dependency for any other of the libav* libs, I decided this is a good candidate. It's also convenient as I use FFmpeg libs heavily in a commercial product, and one of the features I've been working on involves a basic multi object tracking. In my case, it does not need to be a "perfect" tracking algorithm, as I need to compromise quality of the result in exchange of performance executing in the CPU only, so most of the algorithms out there that need a GPU are out of my range. I decided then use as first experiment a filter called `track_sort` that implements the 2016 paper SIMPLE ONLINE AND REALTIME TRACKING WITH A DEEP ASSOCIATION METRIC, as known as SORT [1]. The filter already works well based on the `master` branch, but the code itself is in very early stages and far from being "production ready", so please do not read the code assuming it's in its final form. It's ugly and needs lots of refactoring. I've created a PR on forgejo [4] to make it easier for others to track progress, although I use gitlab.com as my main forge. Here is a description of the filter: - It perform only object tracking, needing the object detection to be performed elsewhere. It feeds from the detection boxes generated by `dnn_detect`. That means that the quality of the the tracking is closely related to the quality of the detection. - SORT is a simple algorithm that uses spatial data only, and it not able to handle cases such as object occlusion. It's good enough for my use case, as I mentioned earlier. - The filter works with the default options, so you can pass it without any arguments. In this mode, it will try to track any objects from the boxes available. You can change this behaviour by specifying the list of labels to track, for example: `track_sort=labels=person|dog|cat`. Such labels come from the ML model you used in the detection filter. It also has the options `threshold`, `min_hits` and `max_age`, which control how the tracking algorithm works, but the default values should work well on most cases. - The filter will add the tracking information as label on a new frame side data entry of type `AV_FRAME_DATA_DETECTION_BBOXES`. It **WILL NOT** override the side data from `dnn_detect`,, meaning that the frame will have side data two entries of this type. I've created a PR that make it possible to fetch such entry [2]. - The labels in the detection boxes have the format "track:<track_num>:<track_age>", and this is not the final format. I did this way as a quick hack to have some visual information when drawing the boxes and labels with the `drawtext` and `drawbox` filters. I believe this can be improved by putting the tracking information as metadata of the `AVDetectionBBox`es, but this would on API and ABI breaking, so this is still an open question. What has not been done so far: I had quite a few goals in this task: - 1: get a working and efficient implementation of the SORT algorithm. - 2: start learning Rust again (it's been ~5 years since I used it) - 3: learn more about the libavfilter codebase - 4: evaluate whether Rust could work as a second language for hacking FFMpeg. Results: - 1: I managed to reuse lots of high quality code, available on crates (the repository of Rust packages), preventing me of needing to write hairy math heavy code. I personally suck in maths, especially linear algebra. Using the paper and the reference implementation [3] was enough, although I do not understand all the math magic. For instance, I reused an existing crate for Kalman filters that I probably would need to implement by hand, as the alternative in C would probably be using the implementation that OpenCV offers. And I am aware that it's not practical to make OpenCV a dependency of FFmpeg. - 2: yay! Back to Rust! - 3: I've learned more not only about avfilter, but a bit about other components as well. - 4: I have more notes on that later, but it feels for me that Rust is natural candidate for new code in large C codebases, as it integrates quite tell, with some warts. I also have no idea whether the FFmpeg community has discussed about Rust in the codebase in the past and, if, not, why not now? Some notes on using Rust: In general I enjoyed using Rust in the project, and if you have a look at the code, you'll notice that I am not reusing any of the nice C macros that make a lot of stuff easier on writing new filters. That means that the Rust code looks like the expanded macro versions from C. And that's a lot of boilerplate and ugly code. There were some reasons for that: One is that I am still learning Rust macros, and wanted to focus on getting stuff done for now. Second is that Rust has a much more powerful macro system than C does, and avoiding macros now allow me to feel all the pain of writing the manual code. Such pain, I believe, can help a set of Rust macros to "emerge" from the codebase, rather than one designing a set of macros that will probably look like the C ones, which might not be "rusty" enough. And I don't find a good practise to design APIs before having some implementation (looking at you, C++ committee). I've been developing on Manjaro Linux and for now building FFmpeg statically with `--disable-stripping --enable-debug=3 --disable-optimizations` and the Rust code in `Debug` mode. That means slow code and static builds, which are easy to debug a profile. Debugging is easy, as I can simply use GDB and it simply works with the Rust and C code mixed. I stil don't have pretty-printer for the Rust part, but this is probably an issue on my setup. Profiling also works well. Even though the Rust code is in Debug mode, profiling with Hotspot/Perf shows that the tracking code is very efficient (you almost cannot see it in the flamegraph!). Memory management is a breeze, as the standard library has generic versions of many useful containers, such as Vectors and BTrees. The algorithms there also make transforming and filtering very convenient and type safe. You get support for unit tests for free. No hassle, no complex setup. Simply write unit tests anywhere and run them with `cargo test`. It feels very good to get the code to work and not being afraid of things going badly (in the code which is not unsafe, of course!). WARTS I did not implement any wrapper on top of the avfilter private API (yay `bindgen`!), so it's used directly on the Rust code. It forces you to write the code as `unsafe` on any interaction with libav* API. Nevertheless, even on unsafe code, working on non owned data is very convenient, as you can turn almost anything into slices, which provide you with lots of convenient algorithms (map, filter, zip, etc.). Working with C pointers is a very painful and ugly. Especially `**` and `***`. Rust is very verbose on using them in the rust side (they become things like `&*mut *mut *mut`, not really easy to reason about). Rust also does not have the `->` operator, forcing you do do stuff like ``(*foo).bar`, which is simply ugly. Interacting with the C API is also not trivial, as in Rust one must be explicit about ownership and lifetimes, something which is done implicitely (and often wrongly) in C. Struct members in Rust must always be explicitely initialized, even for global static variables, which C initializes with zero implicitely. C unions. Luckily Rust supports them, but they are always unsafe. `bindgen` does not generate wrappers for `static av_always_inline blah()` functions, as those are... inlined, so when in the need of using those, I had to simply reimplement them in Rust. In general my impression is that Rust code is more verbose than C in "dangerous" code, but way less verbose in safe code, due to the compiler checks. WHY? WHY? WHY????? Ok, why do I, who never really took part on the FFmpeg community come apparently now throwing Rust on your faces? Am I saying you folks should rewrite ffmpeg in rust? I know that especially the Rust community have been involved recently in a lot of conflicts involving large C codebases, and it's not my intention to tell you what or not to do. I recognize having no authority in this group for that and I am essentially just a FFmpeg user. My intention, first of all, was to get some stuff I needed done. I'm working on a commercial product, and developing in Rust was the quickest way I could get it done (considering my requirements). I've enjoyed a lot working in this project, and I believe my learnings can be useful for the FFmpeg community as a whole. Demo time Requirements: Cargo/Rust installed. I am using `1.84.0`, the latest stable, via `rustup`. You'll need openvino, harfbuzz and freetype installed. First of all, check out the code from the PR at [4] and compile FFmpeg with: ```sh ./configure ./configure --disable-stripping --enable-debug=3 --disable-optimizations --enable-libopenvino --enable-libharfbuzz --enable-libfreetype --enable-openssl cargo build && make ``` I added a `--enable-rust` flag to the PR, but at the moment it does nothing :-) Next you should download a pre-trained YOLO4 model and associated files, for perform the object detections: ```sh pip install openvino-dev tensorflow omz_downloader --name yolo-v4-tiny-tf omz_converter --name yolo-v4-tiny-tf wget https://raw.githubusercontent.com/openvinotoolkit/open_model_zoo/refs/heads/master/data/dataset_classes/coco_80cl.txt ``` Here we'll use a video from MOT Challenge 2016, [5] which is the one shown in the original SORT paper. You can use it with the command: ```sh ./ffplay https://motchallenge.net/sequenceVideos/MOT16-06-raw.webm -vf 'dnn_detect=dnn_backend=openvino:model=public/yolo-v4-tiny-tf/FP16/yolo-v4-tiny-tf.xml:input=image_input:confidence=0.1:model_type=yolov4:anchors=81&82&135&169&344&319:labels=coco_80cl.txt:async=0:nb_classes=80,track_sort=labels=person,drawbox=box_source=side_data_detection_bboxes:color=red:skip=1,drawtext=text_source=side_data_detection_bboxes:fontcolor=yellow:bordercolor=yellow:fontsize=20:fontfile=DroidSans-Bold.ttf:skip=1' ``` The `dnn_detect` options were obtained from the YOLO4 model at [6]. Please also noticed I passed the extra option `skip=1` to both the `drawtext` and the `drawbox` filters. This is to make them render the boxes information from `track_sort` , instead of the ones from `dnn_detect`. More at [2]. I also recorded a video showing the filter in action [7]. Cheers, Leandro [1] https://arxiv.org/pdf/1703.07402 [2] https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/10 [3] https://github.com/abewley/sort [4] https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/11 [5] https://motchallenge.net/vis/MOT16-06 [6] https://github.com/openvinotoolkit/open_model_zoo/blob/master/models/public/yolo-v4-tiny-tf/README.md [7] https://youtu.be/U_y4-NnaINg _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".