On Mon, Jan 25, 2016 at 7:50 PM, Michael Niedermayer <mich...@niedermayer.cc> wrote: > On Mon, Jan 25, 2016 at 04:39:49PM +0100, Hendrik Leppkes wrote: >> On Mon, Jan 25, 2016 at 1:28 PM, Ronald S. Bultje <rsbul...@gmail.com> wrote: >> > Hi, >> > >> > On Mon, Jan 25, 2016 at 4:01 AM, wm4 <nfx...@googlemail.com> wrote: >> > >> >> On Sun, 24 Jan 2016 20:03:01 -0500 >> >> "Ronald S. Bultje" <rsbul...@gmail.com> wrote: >> >> >> >> > Hi, >> >> > >> >> > On Sun, Jan 24, 2016 at 6:49 PM, Hendrik Leppkes <h.lepp...@gmail.com> >> >> > wrote: >> >> >> >> >> >> > > Unfortunately that doesn't alleviate the other issues, like the >> >> > > complexity needed in the decoders during frame threading, or the extra >> >> > > resources needed (extra image surfaces for every thread). >> >> > > >> >> > >> >> > So, the extra code is just in the decoders, which already need it anyway >> >> > (because they implement frame-mt), right? Or do hwaccels need extra code >> >> > also? >> >> > >> >> > The extra resources aren't a big deal IMO. Memory use isn't typically a >> >> big >> >> > issue, we're adding a fw kb extra for contexts but practically all >> >> > memory >> >> > is in framebuffers regardless. >> >> >> >> It's can be a big deal for hardware decoding, because hw surfaces >> >> might be a more constrained resource than system RAM. Also, you often >> >> have to preallocate _all_ surfaces you're going to use, so you'll have >> >> to add the exact number of additionally needed surfaces to the >> >> preallocation. >> > >> > >> > If only one thread is active, the rest never has to be inited and thus >> > contains no surfaces (or framebuffers, or anything), right? If not, that >> > should be a trivial win. >> > >> >> If you can implement it like this, ie. only make one single thread do >> the work, that would also avoid a bunch of the complexity with copying >> contexts around and avoiding multiple init calls of the hwaccel. >> On top of that, avoid the extra resource requirements and the delay >> inherent to frame threading otherwise, since no extra frames are >> "cached" inside the other worker threads. > > is there no hwaccel that (can) work(s) with MT ? > iam bringing that up here before code is unconditionally removed that > might be needed for such case >
Like I explained in an earlier post above, hwaccels don't MT, they execute async on a worker thread, but never more than one at the same time. The only reason someone might potentially see any speed up from using hwaccel+MT is from the sheer lack of optimizations in ffmpeg_<dxva2/vdpau>.c. Some very basic pipelining would give the same speed up, instead of forcing the hardware to sync every frame immediately as it is right now. Thats really all the MT case does today: It adds a "delay", that allows the hardware internally to work more in parallel, but you don't need MT to do that, you can just buffer 2-4 output frames before trying to process them and achieve the exact same speedup. This behavior was confirmed by an NVIDIA engineer some years ago - the hardware has several "stages", and for optimal performance you should keep multiple frames inside the hardware. The decode APIs don't allow this, so the GPU already returns you a frame while its still being decoded in a later stage - and once you try to access it, the GPU has to "sync" the frame and wait until its done. If you just buffer it for a bit (say a 2 frame ring buffer), this bottleneck goes away and all is fast. - Hendrik _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel