[AMD Official Use Only - AMD Internal Distribution Only] Hi,
The provided text describes a technical issue encountered while implementing a zero-copy hardware-accelerated transcoding solution using FFmpeg. Then we discuss workaround/solution aiming to leverage hardware acceleration to decode, scale, overlay, and encode video streams. The problem arises specifically in the transition to FFmpeg version 7.1, where the pipeline stalls due to synchronization issues between multiple video streams fed into a filter graph. The text elaborately documents the underlying problem, intermediate steps attempted for a fix, and further questions needed to refine the solution approach. Pipeline Workflow: -Two video sources are decoded into separate 360p and 4K streams -The 360p stream is scaled up using a hardware-accelerated scaler. -Both streams are then combined using a hardware-accelerated overlay. -The final overlay output is encoded into a final stream. Usecase:- ffmpeg -hwaccel ... -i 360p60.264 -i 4kp60.264 -filter_complex \ "[IN1]scale_hw[TMP1];[IN2][TMP2]overlay_hw[OUT]" -map "[OUT]" ... FILTERGRAPH +----------------------------+ +-------------+ | +--------+ 480p | |Decoder(360p)+----+ +-->+--->| Scaler +------+ | +-------------+ | | | +--------+ | | | +-------+ | | | | +--->|Queue +---+ | | | | +-------+ | | v | +-------------+ | | | +---------+ | +---------+ |Decoder(4k) +----+ +-->+-------------->| Overlay +--+---->| Encoder | +-------------+ | +---------+ | +---------+ +----------------------------+ Issue:- The above HW-accelerated use case works fine on ffmpeg 6.x but stalls on ffmpeg n7.1 (after migration) With all SW plugin this use-case works fine on n7.1 as well, but using HW-accelerated plugins (for decode, scale, overlay and encode) causes pipeline stalls. Limitations:- Limited output frame pools (allocated at init) for all the HW-accelerated plugins. The HW memory is limited so these must be kept as small as possible. A plugin will wait for someone downstream to consume/free the frame it has sent out if it runs out of frames in it's frame pool. Cause:- As per our detailed analysis and current understanding of ffmpeg7.1 application changes (multithreading support), the issue is because both the decoded outputs are fed to the filter graph using a common thread queue. Since the 2 decoders are independent and on separate threads this causes the decoder(360p) to generate more frames before the other decoder(4K) can generate a single frame. Now the overlay plugin needs at least one frame at both the inputs to proceed, thus many HW frames get buffered on one of the input(scale->overlay) of overlay filter, causing us to run out of free HW frames, hence the pipeline is stalled. Detailed Explanation:- 1. Frame Pools Involved. a. Decoder(360p) out pool (frames consumed/freed by scaler) b. Decoder(4k) out pool (frames consumed/freed by overlay, when both inputs are available and an output is ready) c. Scaler out pool (frames consumed/freed by overlay, when both inputs are available and an output is ready) d. Overlay out pool (frames consumed by encoder) 2. Execution model of scaler_hw plugin. The plugin runs completely on the filtergraph's thread. Assume the out pool has a size N. On receiving an input frame, the plugin first tries to allocate a frame from out pool. If frame is not available the filtergraph's thead will block, which means the overlay filter will never get a chance to run and consume a frame and we have a deadlock. Thus we must ensure scaler never processes N more frames than overlay. 3. Filtergraph Execution model. All the filters in a filtergraph run on the same thread. Every filterlink has an infinitely expanding queue to buffer inputs. For multi input filters activate is only called when all inputs are available. 4. FFmpeg application scheduler (ffmpeg_sched.c) Filtergraphs with multiple inputs have a single queue into which all inputs by various decoders are written. Even though filtergraph has a concept of best input, only the top entry from queue is fed to the filtergraph, thus making the best input a mere suggestion and not a binding request. This means a faster demuxer/decoder combo can flood one input of the filtergraph causing the pipeline stalls as we run out of HW frames. Workarounds tried:- Modify ffmpeg_sched.c to use multiple queues (1 for every filtergraph input) and always feed the best requested input to filtergraph. (Diff versus n7.1 attached) Side-Effect This works for most of the cases but fails when we have a single file with multiple streams fed to separate decoders and then a filter inside the filtergraph tries to combine those two inputs. If the streams inside the file are not properly interleaved such a use-case will cause a circular deadlock with this workaround. Specifically the fate-test filter-overlay-dvdsub-2397 hangs. Questions:- 1. There is a schedule concept in ffmpeg_sched.c but it not very strict and it works by looking at timestamps of the muxer not demuxer or decoder. Obviously, this is needed to utilize multiple threads fully. But is it possible to modify the scheduler to be more strict? For example add a constraint that one decoder will never run ahead of any other decoder by more that K frames. 2. If we modify the filter to return TRYAGAIN instead of wait when output buffer pool is empty and then flush out all available outputs at the next activation, will the filtergraph mechanism be able to handle such a filter? What will the activate function of such a filter look like? Regards, Nitin Kamboj
0001-Make-multiple-queue-changes.patch
Description: 0001-Make-multiple-queue-changes.patch
_______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".