On 2020-08-07 23:59, Soft Works wrote:
-----Original Message-----
From: ffmpeg-devel <ffmpeg-devel-boun...@ffmpeg.org> On Behalf Of
Steve Lhomme
Sent: Friday, August 7, 2020 3:05 PM
To: ffmpeg-devel@ffmpeg.org
Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer
copies are done before submitting them

I experimented a bit more with this. Here are the 3 scenarii in other of least
frame late:

- GetData waiting for 1/2s and releasing the lock
- No use of GetData (current code)
- GetData waiting for 1/2s and keeping the lock

The last option has horrible perfomance issues and should not be used.

The first option gives about 50% less late frames compared to the current
code. *But* it requires to unlock the Video Context. There are 2 problems
with this:

- the same ID3D11Asynchronous is used to wait on multiple concurrent
thread. This can confuse D3D11 which emits a warning in the logs.
- another thread might Get/Release some buffers and submit them before
this thread is finished processing. That can result in distortions, for example 
if
the second thread/frame depends on the first thread/frame which is not
submitted yet.

The former issue can be solved by using a ID3D11Asynchronous per thread.
That requires some TLS storage which FFmpeg doesn't seem to support yet.
With this I get virtually no frame late.

The latter issue only occur if the wait is too long. For example waiting by
increments of 10ms is too long in my test. Using increments of 1ms or 2ms
works fine in the most stressing sample I have (Sony Camping HDR HEVC high
bitrate). But this seems hackish. There's still potentially a quick frame (alt
frame in VPx/AV1 for example) that might get through to the decoder too
early. (I suppose that's the source of the distortions I
see)

It's also possible to change the order of the buffer sending, by starting with
the bigger one (D3D11_VIDEO_DECODER_BUFFER_BITSTREAM). But it seems
to have little influence, regardless if we wait for buffer submission or not.

The results are consistent between integrated GPU and dedicated GPU.

Hi Steven,

Hi,

A while ago I had extended D3D11VA implementation to support single
(non-array textures) for interoperability with Intel QSV+DX11.

Looking at your code, it seems you are copying from an array texture to a single slice texture to achieve this. With double the amount of RAM. It may be a design issue with the new D3D11 API, which forces you to do that, but I'm not using that API. I'm using the old API.

In my case I directly render the texture slices coming out of the decoder with no copying (and no extra memory allocation). It is happening in a different thread than the decoder thread(s).

Also in VLC we also support direct D3D11 to QSV encoding. It does require a copy to "shadow" textures to feed QSV. I never managed to make it work without a copy.

I noticed a few bottlenecks making D3D11VA significantly slower than DXVA2.

The solution was to use ID3D10Multithread_SetMultithreadProtected and
remove all the locks which are currently applied.

I am also using that.

Hence, I don't think that your patch is the best possible way .

Removing locks and saying "it works for me" is neither a correct solution. At the very least the locks are needed inside libavcodec to avoid setting DXVA buffers concurrently from different threads. It will most likely result in very bad distortions if not crashes. Maybe you're only using 1 decoding thread with DXVA (which a lot of people do) so you don't have this issue, but this is not my case.

Also ID3D10Multithread::SetMultithreadProtected means that the resources can be accessed from multiple threads. It doesn't mean that calls to ID3D11DeviceContext are safe from multithreading. And my experience shows that it is not. In fact if you have the Windows SDK installed and you have concurrent accesses, you'll get a big warning in your debug logs that you are doing something fishy. On WindowsPhone it would even crash. This is how I ended up adding the mutex to the old API (e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86).

The documentation for ID3D11DeviceContext is very clear about that [1]:
"Because each ID3D11DeviceContext is single threaded, only one thread can call a ID3D11DeviceContext at a time. If multiple threads must access a single ID3D11DeviceContext, they must use some synchronization mechanism, such as critical sections, to synchronize access to that ID3D11DeviceContext."

The DXVA documentation is a lot less clearer on the subject. But given the ID3D11VideoContext derives from a ID3D11DeviceContext (but is not a ID3D11DeviceContext) it's seem correct to assume it has the same restrictions.

[1] https://docs.microsoft.com/en-us/windows/win32/direct3d11/overviews-direct3d-11-render-multi-thread-intro
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to