[FFmpeg-devel] [Query] Issues with FFMPEG 7.1 and HW plugins

Kamboj, Nitin via ffmpeg-devel Sun, 11 May 2025 06:38:55 -0700

[AMD Official Use Only - AMD Internal Distribution Only]

Hi,


The provided text describes a technical issue encountered while implementing a 
zero-copy
hardware-accelerated transcoding solution using FFmpeg. Then we discuss 
workaround/solution
aiming to leverage hardware acceleration to decode, scale, overlay, and encode 
video streams.
The problem arises specifically in the transition to FFmpeg version 7.1, where 
the pipeline stalls due
to synchronization issues between multiple video streams fed into a filter 
graph. The text elaborately
documents the underlying problem, intermediate steps attempted for a fix, and 
further questions
needed to refine the solution approach.

Pipeline Workflow:
    -Two video sources are decoded into separate 360p and 4K streams
    -The 360p stream is scaled up using a hardware-accelerated scaler.
    -Both streams are then combined using a hardware-accelerated overlay.
    -The final overlay output is encoded into a final stream.

Usecase:-
        ffmpeg -hwaccel ... -i 360p60.264 -i 4kp60.264 -filter_complex \
          "[IN1]scale_hw[TMP1];[IN2][TMP2]overlay_hw[OUT]" -map "[OUT]" ...



                                                 FILTERGRAPH
                                        +----------------------------+
+-------------+                         |    +--------+ 480p         |
|Decoder(360p)+----+                +-->+--->| Scaler +------+       |
+-------------+    |                |   |    +--------+      |       |
                   |    +-------+   |   |                    |       |
                   +--->|Queue  +---+   |                    |       |
                   |    +-------+   |   |                    v       |
+-------------+    |                |   |               +---------+  |     
+---------+
|Decoder(4k)  +----+                +-->+-------------->| Overlay +--+---->| 
Encoder |
+-------------+                         |               +---------+  |     
+---------+
                                        +----------------------------+




Issue:-
        The above HW-accelerated use case works fine on ffmpeg 6.x but stalls
        on ffmpeg n7.1 (after migration)
        With all SW plugin this use-case works fine on n7.1 as well, but using
        HW-accelerated plugins (for decode, scale, overlay and encode) causes
        pipeline stalls.

Limitations:-
        Limited output frame pools (allocated at init) for all the
        HW-accelerated plugins. The HW memory is limited so these must
        be kept as small as possible. A plugin will wait for someone
        downstream to consume/free the frame it has sent out if it runs
        out of frames in it's frame pool.

Cause:-
        As per our detailed analysis and current understanding of
        ffmpeg7.1 application changes (multithreading support),
        the issue is because both the decoded outputs are fed to the
        filter graph using a common thread queue.
        Since the 2 decoders are independent and on separate threads this
        causes the decoder(360p) to generate more frames before the other
        decoder(4K) can generate a single frame. Now the overlay plugin needs
        at least one frame at both the inputs to proceed, thus many HW frames
        get buffered on one of the input(scale->overlay) of overlay filter,
   causing us to run out of free HW frames, hence the pipeline is stalled.

Detailed Explanation:-
        1. Frame Pools Involved.
            a. Decoder(360p) out pool (frames consumed/freed by scaler)
            b. Decoder(4k) out pool (frames consumed/freed by overlay,
                when both inputs are available and an output is ready)
            c. Scaler out pool (frames consumed/freed by overlay, when
                both inputs are available and an output is ready)
            d. Overlay out pool (frames consumed by encoder)
        2. Execution model of scaler_hw plugin.
            The plugin runs completely on the filtergraph's thread. Assume the
            out pool has a size N. On receiving an input frame, the plugin first
            tries to allocate a frame from out pool.
            If frame is not available the filtergraph's thead will block, which
            means the overlay filter will never get a chance to run and consume
            a frame and we have a deadlock. Thus we must ensure scaler
            never processes N more frames than overlay.
        3. Filtergraph Execution model.
            All the filters in a filtergraph run on the same thread.
            Every filterlink has an infinitely expanding queue to
            buffer inputs. For multi input filters activate
            is only called when all inputs are available.
        4. FFmpeg application scheduler (ffmpeg_sched.c)
            Filtergraphs with multiple inputs have a single queue
            into which all inputs by various decoders are written.
            Even though filtergraph has a concept of best input, only the top
            entry from queue is fed to the filtergraph, thus making the
            best input a mere suggestion and not a binding request. This means
            a faster demuxer/decoder combo can flood one input of the
            filtergraph causing the pipeline stalls as we run out of HW
            frames.

Workarounds tried:-
        Modify ffmpeg_sched.c to use multiple queues (1 for every filtergraph
        input) and always feed the best requested input to filtergraph.
        (Diff versus n7.1 attached)

        Side-Effect
            This works for most of the cases but fails when we have a single 
file with
            multiple streams fed to separate decoders and then a filter inside 
the
            filtergraph tries to combine those two inputs.
            If the streams inside the file are not properly interleaved such a
            use-case will cause a circular deadlock with this workaround.
            Specifically the fate-test filter-overlay-dvdsub-2397 hangs.

Questions:-
        1. There is a schedule concept in ffmpeg_sched.c but it not very strict
           and it works by looking at timestamps of the muxer not demuxer or 
decoder.
           Obviously, this is needed to utilize multiple threads fully. But is 
it
           possible to modify the scheduler to be more strict?
           For example add a constraint that one decoder will never run ahead
           of any other decoder by more that K frames.
        2. If we modify the filter to return TRYAGAIN instead of wait when 
output
           buffer pool is empty and then flush out all available outputs at the
           next activation, will the filtergraph mechanism be able to handle 
such
           a filter?
           What will the activate function of such a filter look like?

Regards,
Nitin Kamboj

0001-Make-multiple-queue-changes.patch
Description: 0001-Make-multiple-queue-changes.patch

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [Query] Issues with FFMPEG 7.1 and HW plugins

Reply via email to