[insert meme here]

(this will be a long e-mail)

Dear FFmpeg devs,

in the past days I've been experimenting hacking FFmpeg using Rust.

As I am becoming more familiar with the libavfilter, and it is not a dependency 
for any other of the libav* libs, I decided this is a good candidate.

It's also convenient as I use FFmpeg libs heavily in a commercial product, and 
one of the features I've been working on involves a basic multi object tracking.

In my case, it does not need to be a "perfect" tracking algorithm, as I need to 
compromise quality of the result in exchange of performance executing in the 
CPU only, so most of the algorithms out there that need a GPU are out of my 
range.

I decided then use as first experiment a filter called `track_sort` that 
implements the 2016 paper SIMPLE ONLINE AND REALTIME TRACKING WITH A DEEP 
ASSOCIATION METRIC, as known as SORT [1].

The filter already works well based on the `master` branch, but the code itself 
is in very early stages and far from being "production ready", so please do not 
read the code assuming it's in its final form. It's ugly and needs lots of 
refactoring.

I've created a PR on forgejo [4] to make it easier for others to track 
progress, although I use gitlab.com as my main forge.

Here is a description of the filter:

- It perform only object tracking, needing the object detection to be performed 
elsewhere. It feeds from the detection boxes generated by `dnn_detect`. That 
means that the quality of the the tracking is closely related to the quality of 
the detection.

- SORT is a simple algorithm that uses spatial data only, and it not able to 
handle cases such as object occlusion. It's good enough for my use case, as I 
mentioned earlier.

- The filter works with the default options, so you can pass it without any 
arguments. In this mode, it will try to track any objects from the boxes 
available. You can change this behaviour by specifying the list of labels to 
track, for example: `track_sort=labels=person|dog|cat`. Such labels come from 
the ML model you used in the detection filter. It also has the options 
`threshold`, `min_hits` and `max_age`, which control how the tracking algorithm 
works, but the default values should work well on most cases.

- The filter will add the tracking information as label on a new frame side 
data entry of type `AV_FRAME_DATA_DETECTION_BBOXES`. It **WILL NOT** override 
the side data from `dnn_detect`,, meaning that the frame will have side data 
two entries of this type. I've created a PR that make it possible to fetch such 
entry [2].

- The labels in the detection boxes have the format 
"track:<track_num>:<track_age>", and this is not the final format. I did this 
way as a quick hack to have some visual information when drawing the boxes and 
labels with the `drawtext` and `drawbox` filters. I believe this can be 
improved by putting the tracking information as metadata of the 
`AVDetectionBBox`es, but this would on API and ABI breaking, so this is still 
an open question.

What has not been done so far:

I had quite a few goals in this task:

- 1: get a working and efficient implementation of the SORT algorithm.
- 2: start learning Rust again (it's been ~5 years since I used it)
- 3: learn more about the libavfilter codebase
- 4: evaluate whether Rust could work as a second language for hacking FFMpeg.

Results:

- 1: I managed to reuse lots of high quality code, available on crates (the 
repository of Rust packages), preventing me of needing to write hairy math 
heavy code. I personally suck in maths, especially linear algebra. Using the 
paper and the reference implementation [3] was enough, although I do not 
understand all the math magic. For instance, I reused an existing crate for 
Kalman filters that I probably would need to implement by hand, as the 
alternative in C would probably be using the implementation that OpenCV offers. 
And I am aware that it's not practical to make OpenCV a dependency of FFmpeg.

- 2: yay! Back to Rust!

- 3: I've learned more not only about avfilter, but a bit about other 
components as well.

- 4: I have more notes on that later, but it feels for me that Rust is natural 
candidate for new code in large C codebases, as it integrates quite tell, with 
some warts. I also have no idea whether the FFmpeg community has discussed 
about Rust in the codebase in the past and, if, not, why not now?

Some notes on using Rust:

In general I enjoyed using Rust in the project, and if you have a look at the 
code, you'll notice that I am not reusing any of the nice C macros that make a 
lot of stuff easier on writing new filters. That means that the Rust code looks 
like the expanded macro versions from C. And that's a lot of boilerplate and 
ugly code.

There were some reasons for that: One is that I am still learning Rust macros, 
and wanted to focus on getting stuff done for now. Second is that Rust has a 
much more powerful macro system than C does, and avoiding macros now allow me 
to feel all the pain of writing the manual code. Such pain, I believe, can help 
a set of Rust macros to "emerge" from the codebase, rather than one designing a 
set of macros that will probably look like the C ones, which might not be 
"rusty" enough. And I don't find a good practise to design APIs before having 
some implementation (looking at you, C++ committee).

I've been developing on Manjaro Linux and for now building FFmpeg statically 
with `--disable-stripping --enable-debug=3 --disable-optimizations` and the 
Rust code in `Debug` mode. That means slow code and static builds, which are 
easy to debug a profile.

Debugging is easy, as I can simply use GDB and it simply works with the Rust 
and C code mixed. I stil don't have pretty-printer for the Rust part, but this 
is probably an issue on my setup.

Profiling also works well. Even though the Rust code is in Debug mode, 
profiling with Hotspot/Perf shows that the tracking code is very efficient (you 
almost cannot see it in the flamegraph!).

Memory management is a breeze, as the standard library has generic versions of 
many useful containers, such as Vectors and BTrees. The algorithms there also 
make transforming and filtering very convenient and type safe.

You get support for unit tests for free. No hassle, no complex setup. Simply 
write unit tests anywhere and run them with `cargo test`.

It feels very good to get the code to work and not being afraid of things going 
badly (in the code which is not unsafe, of course!).

WARTS

I did not implement any wrapper on top of the avfilter private API (yay 
`bindgen`!), so it's used directly on the Rust code. It forces you to write the 
code as `unsafe` on any interaction with libav* API. Nevertheless, even on 
unsafe code, working on non owned data is very convenient, as you can turn 
almost anything into slices, which provide you with lots of convenient 
algorithms (map, filter, zip, etc.).

Working with C pointers is a very painful and ugly. Especially `**` and `***`. 
Rust is very verbose on using them in the rust side (they become things like 
`&*mut *mut *mut`, not really easy to reason about). Rust also does not have 
the `->` operator, forcing you do do stuff like ``(*foo).bar`, which is simply 
ugly.

Interacting with the C API is also not trivial, as in Rust one must be explicit 
about ownership and lifetimes, something which is done implicitely (and often 
wrongly) in C.

Struct members in Rust must always be explicitely initialized, even for global 
static variables, which C initializes with zero implicitely.

C unions. Luckily Rust supports them, but they are always unsafe.

`bindgen` does not generate wrappers for `static av_always_inline blah()` 
functions, as those are... inlined, so when in the need of using those, I had 
to simply reimplement them in Rust.

In general my impression is that Rust code is more verbose than C in 
"dangerous" code, but way less verbose in safe code, due to the compiler checks.

WHY? WHY? WHY?????

Ok, why do I, who never really took part on the FFmpeg community come 
apparently now throwing Rust on your faces? Am I saying you folks should 
rewrite ffmpeg in rust? I know that especially the Rust community have been 
involved recently in a lot of conflicts involving large C codebases, and it's 
not my intention to tell you what or not to do. I recognize having no authority 
in this group for that and I am essentially just a FFmpeg user.

My intention, first of all, was to get some stuff I needed done. I'm working on 
a commercial product, and developing in Rust was the quickest way I could get 
it done (considering my requirements). I've enjoyed a lot working in this 
project, and I believe my learnings can be useful for the FFmpeg community as a 
whole.

Demo time

Requirements: Cargo/Rust installed. I am using `1.84.0`, the latest stable, via 
`rustup`.

You'll need openvino, harfbuzz and freetype installed.

First of all, check out the code from the PR at [4] and compile FFmpeg with:

```sh
./configure ./configure --disable-stripping --enable-debug=3 
--disable-optimizations --enable-libopenvino --enable-libharfbuzz 
--enable-libfreetype --enable-openssl
cargo build && make
```

I added a `--enable-rust` flag to the PR, but at the moment it does nothing :-)

Next you should download a pre-trained YOLO4 model and associated files, for 
perform the object detections:

```sh
pip install openvino-dev tensorflow
omz_downloader --name yolo-v4-tiny-tf
omz_converter --name yolo-v4-tiny-tf
wget 
https://raw.githubusercontent.com/openvinotoolkit/open_model_zoo/refs/heads/master/data/dataset_classes/coco_80cl.txt
```

Here we'll use a video from MOT Challenge 2016, [5] which is the one shown in 
the original SORT paper. You can use it with the command:

```sh
./ffplay https://motchallenge.net/sequenceVideos/MOT16-06-raw.webm -vf 
'dnn_detect=dnn_backend=openvino:model=public/yolo-v4-tiny-tf/FP16/yolo-v4-tiny-tf.xml:input=image_input:confidence=0.1:model_type=yolov4:anchors=81&82&135&169&344&319:labels=coco_80cl.txt:async=0:nb_classes=80,track_sort=labels=person,drawbox=box_source=side_data_detection_bboxes:color=red:skip=1,drawtext=text_source=side_data_detection_bboxes:fontcolor=yellow:bordercolor=yellow:fontsize=20:fontfile=DroidSans-Bold.ttf:skip=1'
```

The `dnn_detect` options were obtained from the YOLO4 model at [6].

Please also noticed I passed the extra option `skip=1` to both the `drawtext` 
and the `drawbox` filters. This is to make them render the boxes information 
from  `track_sort` , instead of the ones from `dnn_detect`. More at [2].

I also recorded a video showing the filter in action [7].

Cheers,

Leandro


[1] https://arxiv.org/pdf/1703.07402
[2] https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/10
[3] https://github.com/abewley/sort
[4] https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/11
[5] https://motchallenge.net/vis/MOT16-06
[6] 
https://github.com/openvinotoolkit/open_model_zoo/blob/master/models/public/yolo-v4-tiny-tf/README.md
[7] https://youtu.be/U_y4-NnaINg

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to