On 2020-08-08 8:24, Soft Works wrote:


-----Original Message-----
From: ffmpeg-devel <ffmpeg-devel-boun...@ffmpeg.org> On Behalf Of
Steve Lhomme
Sent: Saturday, August 8, 2020 7:10 AM
To: ffmpeg-devel@ffmpeg.org
Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer
copies are done before submitting them

[...]


Hi Steven,

Hi,

A while ago I had extended D3D11VA implementation to support single
(non-array textures) for interoperability with Intel QSV+DX11.

Looking at your code, it seems you are copying from an array texture to a
single slice texture to achieve this. With double the amount of RAM.
It may be a design issue with the new D3D11 API, which forces you to do

With D3D11, it's mandatory to use a staging texture, which is not only done
in my code but also in the original implementation (hwcontext_d3d11va.c)
https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/hwcontext_d3d11va.c

that, but I'm not using that API. I'm using the old API.

I'm not sure whether I understand what you mean by this. To my knowledge
there are two DX hw context implementations in ffmpeg:

- DXVA2
- D3D11VA

I'm not aware of a variant like "D3D11 with old API". Could you please 
elaborate?

There is AV_PIX_FMT_D3D11VA_VLD (old) and AV_PIX_FMT_D3D11 (new).


Hence, I don't think that your patch is the best possible way .

Removing locks and saying "it works for me" is neither a correct solution.

How did you come to the conclusion that I might be working like this?

The commented out "hwctx->lock" lines in your code.

the very least the locks are needed inside libavcodec to avoid setting DXVA
buffers concurrently from different threads. It will most likely result in very
bad distortions if not crashes. Maybe you're only using 1 decoding thread
with DXVA (which a lot of people do) so you don't have this issue, but this is
not my case.

I see no point in employing multiple threads for hw accelerated decoding.
To be honest I never looked into or tried whether ffmpeg even supports
multiple threads with dva2 or d3d11va hw acceleration.

Maybe you're in an ideal situation where all the files you play through libavcodec are hardware accelerated (so also with matching hardware). In this case you don't need to care about the case where it will fallback to software decoding. Using a single thread in that case would have terrible performance.

Even then, there's still a chance using multiple threads might improve performance. All the code that is run to prepare the buffers that are fed into the hardware decoder can be run in parallel for multiple frames. If you have an insanely fast hardware decoder that would be the bottleneck. In a transcoding scenario that could have an impact.

Also ID3D10Multithread::SetMultithreadProtected means that the resources
can be accessed from multiple threads. It doesn't mean that calls to
ID3D11DeviceContext are safe from multithreading. And my experience
shows that it is not. In fact if you have the Windows SDK installed and you
have concurrent accesses, you'll get a big warning in your debug logs that you
are doing something fishy. On WindowsPhone it would even crash. This is
how I ended up adding the mutex to the old API
(e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86).

The documentation for ID3D11DeviceContext is very clear about that [1]:
"Because each ID3D11DeviceContext is single threaded, only one thread can
call a ID3D11DeviceContext at a time. If multiple threads must access a single
ID3D11DeviceContext, they must use some synchronization mechanism,
such as critical sections, to synchronize access to that ID3D11DeviceContext."

Yes, but this doesn't apply to accessing staging textures IIRC.

It does. To copy to a staging texture you need to use ID3D11DeviceContext::CopySubresourceRegion().

You probably don't have any synchronization issues in your pipeline because it seems you copy from GPU to CPU. In that case it forces the ID3D11DeviceContext::GetData() internally to make sure all the commands to produce your source texture on that video context are finished processing. You may not see it, but there's a wait happening there. In my case there's nothing happening between the decoder and the rendering of the texture.

In fact, I had researched this in-depth, but I can't tell much more without
looking into it again.

The patch I referenced is working in production on thousands of installations
and tested with many different hardware and driver versions from Nvidia,
Intel and AMD.

And I added the lock before, as the specs say, it's necessary. That solved some issues on the hundred of millions of VLC running on Windows on all the hardware you can think of.

Decoding 8K 60 fps HEVC was also a good stress test of the code. The ID3D11DeviceContext::GetData() in the rendering side ensured that we had the frames displayed in the right time and not whenever the pipeline is done processing the device context commands.

Now I realize the same thing should be done on the decoder side. With the improvements given earlier in this thread.
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to