On 24/10/18 21:45, Rogozhkin, Dmitry V wrote:
> On Tue, 2018-10-23 at 23:46 +0100, Mark Thompson wrote:
>> On 23/10/18 08:49, Li, Zhong wrote:
>>>> From: ffmpeg-devel [mailto:ffmpeg-devel-boun...@ffmpeg.org] On
>>>> Behalf
>>>> Of Mark Thompson
>>>> Sent: Sunday, October 14, 2018 12:36 AM
>>>> To: ffmpeg-devel@ffmpeg.org
>>>> Subject: Re: [FFmpeg-devel] [FFmpeg-cvslog] lavu/qsv: make a copy
>>>> as
>>>> libmfx alignment requirement for uploading
>>>>
>>>> On 11/10/18 06:38, Zhong Li wrote:
>>>>> ffmpeg | branch: master | Zhong Li <zhong...@intel.com> | Mon
>>>>> Sep 17
>>>>> 19:16:44 2018 +0800| [681aa7d14f97fd98181ca6d61e11be48fe65692d]
>>>>> |
>>>>> committer: Zhong Li
>>>>>
>>>>> lavu/qsv: make a copy as libmfx alignment requirement for
>>>>> uploading
>>>>>
>>>>> Libmfx requires 16 bytes aligned input/output for uploading.
>>>>> Currently only output is 16 byte aligned and assigning same
>>>>> width/height to input with smaller buffer size actually, thus
>>>>> definitely will
>>>>
>>>> cause segment fault.
>>>>>
>>>>> Can reproduce with any 1080p nv12 rawvideo input:
>>>>> ffmpeg -init_hw_device qsv=qsv:hw -hwaccel qsv
>>>>> -filter_hw_device qsv
>>>>> -f rawvideo -pix_fmt nv12 -s:v 1920x1080 -i 1080p_nv12.yuv -vf
>>>>> 'format=nv12,hwupload=extra_hw_frames=16,hwdownload,format=nv12
>>>>> '
>>>>
>>>> -an
>>>>> -y out_nv12.yuv
>>>>>
>>>>> It can fix #7418
>>>>>
>>>>> Signed-off-by: Zhong Li <zhong...@intel.com>
>>>>>
>>>>>>
>>>>
>>>> http://git.videolan.org/gitweb.cgi/ffmpeg.git/?a=commit;h=681aa7d
>>>> 14f9
>>>>>> 7fd98181ca6d61e11be48fe65692d
>>>>>
>>>>> ---
>>>>>
>>>>>  libavutil/hwcontext_qsv.c | 31 +++++++++++++++++++++++++++++--
>>>>>  1 file changed, 29 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/libavutil/hwcontext_qsv.c
>>>>> b/libavutil/hwcontext_qsv.c
>>>>> index 33e121b416..814ce215ce 100644
>>>>> --- a/libavutil/hwcontext_qsv.c
>>>>> +++ b/libavutil/hwcontext_qsv.c
>>>>> @@ -862,6 +862,10 @@ static int
>>>>
>>>> qsv_transfer_data_to(AVHWFramesContext *ctx, AVFrame *dst,
>>>>>      mfxSyncPoint sync = NULL;
>>>>>      mfxStatus err;
>>>>>      int ret = 0;
>>>>> +    /* make a copy if the input is not padded as libmfx
>>>>> requires */
>>>>> +    AVFrame tmp_frame, *src_frame;
>>>>> +    int realigned = 0;
>>>>> +
>>>>>
>>>>>      while (!s->session_upload_init && !s->session_upload &&
>>>>> !ret) {
>>>>> #if HAVE_PTHREADS @@ -887,16 +891,36 @@ static int
>>>>> qsv_transfer_data_to(AVHWFramesContext *ctx, AVFrame *dst,
>>>>>      if (ret < 0)
>>>>>          return ret;
>>>>>
>>>>> +
>>>>> +    if (src->height & 16 || src->linesize[0] & 16) {
>>>>
>>>> Doesn't this still need to check that the layout is compatible
>>>> with other the
>>>> limitations that libmfx has before trying to do the upload?  In
>>>> particular,
>>>> that the pitches for the different planes are actually all the
>>>> same - I certainly
>>>> wouldn't expect them to be for YUV420P.
>>>
>>> Do you mean our previous discussion on https://lists.libav.org/pipe
>>> rmail/libav-devel/2018-April/086070.html ?
>>> If yes, I will try to reproduce it with hwupload pipeline instead
>>> of transcoding pipeline, and then give an update patch.
>>
>> Yes.  You might need to write a small test program to generation an
>> arbitrary allocation pattern, since libavutil only has a fixed
>> layout.
>>
>>>>
>>>> (IIRC there was a requirement 0 < frame->data[N] - frame->data[0] 
>>>> < 2^24
>>>> as well (or something like that, I might be wrong), which also
>>>> need not be
>>>> true.)
>>>
>>> I don't know such a requirement, could it be found from MSDK
>>> guide? 
>>
>> That came from discussion of hardware capabilities around VAAPI, I
>> think.
> 
> There are upper size frame limitations which HW can support. And this
> limitation can be different for different operations, for example AVC
> encoder supports 4K as a maximum. Jpeg can support 16K. This depend on
> the component and may depend on the generation of hardware. In any case
> I don't think any of these kind of limitations are something to be
> checked within ffmpeg during allocation. That's responsibility of
> underlying implementation via which you try to allocate to return an
> error to ffmpeg that allocation can not be done. Or responsibility of
> the component (msdk or vaapi) to return an error that processing of the
> given surface can not be done (due to per-component limitations).

I meant real hardware capabilities around layout, which are going to be a 
superset of anything which can actually be reported in any particular API (see 
below).

>>
>> Won't something like this be the reason for wanting a virtually
>> contiguous buffer?  If there isn't a restriction along these lines
>> then there is no need to make the buffer virtually
>> contiguous.  (After all, the hardware can't know about C undefined
>> behaviour in pointer comparisons...)
> 
> "Virtually contiguous" sounds dangerous to me. There is a discrepancy
> in treating layouts of color formats in ffmpeg and other libraries (and
> drivers). First of all I did not see clear document describing what
> ffmpeg thinks layout should be. What it seems to me is that ffmpeg
> treats color formats per-plane meaning that in NV12 Y and UV could be
> allocated absolutely separately.

Correct.  FFmpeg deliberately places little constraint on the real layout of an 
NV12 frame - the planes need not be in any particular order or place in memory, 
nor need the pitch have any particular value relative to the width.  The one 
constraint we do have is that each line start at an address which is a multiple 
of some defined platform-specific alignment (and therefore plane pitches are 
multiples of such a value as well).

>                                  While you work with the SW components
> this might be ok. But HW defines pretty specifically what it expects.
> And typically memory layout is supposed to be contiguous, I mean
> _physically_ contiguous.

While this is true in a lot of implementations, it isn't in the ones being 
considered here.  Modern Intel GPUs, including their video hardware, have an 
MMU with the same level of page access as the CPU does, so anything which is 
virtually contiguous from the CPU point of view certainly can be from the GPU 
as well.  Objects which are physically contiguous might be desirable in some 
marginal cases for performance (desirable ordering properties for RAM chips?), 
but certainly aren't necessary.

>                          Secondly, HW may have some other implications
> over layout requiring particular alignments. These could be width
> alignments, i.e. pitches. Or height padding alignments. As of now it
> seems that ffmpeg don't provide an architecture to satisfy all these
> needs. This leads to overheads in terms of memory consumption since
> additional "rightly allocated" memory is required to copy frames and
> eventually this implies processing overheads as well.

These real constraints are mostly unsatifiable for reasonable use-cases of 
Intel GPUs.  For example, H.264 encoding on Skylake requires Y-tiled NV12 
surfaces with a number of additional constraints such as the chroma plane 
having the same pitch as the luma plane and being located at a line offset of 
at most 2^15-1.  This particular constraint is the one I was thinking of above: 
given a pitch of (say) 2048 = 2^11, the chroma plane can't be offset from the 
luma plane by more than 2^26, and in the positive direction only - that pretty 
much means you need to allocate them virtually-contiguously, since planes 
allocated normally have no reason to satisfy that (especially with ASLR 
enabled).  (This comes from 
<https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-skl-vol02a-commandreference-instructions.pdf>
 - MFX_SURFACE_STATE beginning page 818 is relevant.)

The Y-tiling in particular makes it very difficult for any upload case - there 
is no reasonable software implementation of the address swizzle needed to write 
Y-tiled planes.  The one possible getout here is to use the MMU to get a linear 
map of the Y-tiled surface, but in general this causes problems because such a 
mapping is uncached and therefore has very low performance for reads.  (You can 
still use it in FFmpeg with VAAPI if you want - the av_hwframe_map() function 
and the hwmap filter allow it, though it's only really usable for write-only 
surfaces.)

Given that, we have to make a copy somewhere in almost all cases.  I believe 
the change made here for libmfx is done so that the GPU can perform the copy 
itself, lifting the input data directly from CPU memory and therefore saving on 
the CPU time which would be used to perform the same copy.  I can't find this 
particular case in the relevant source code, but I assume if will have some 
similar constraints to the above video case - in particular, the chroma plane 
location relative to the luma plane is probably constrained in the same way (I 
admit I've extrapolated that from the virtual-contiguity requirement, but I 
don't see where else it would come from).

> Here is an example from gstreamer where they describe what they mean
> under color formats: https://gstreamer.freedesktop.org/documentation/de
> sign/mediatype-video-raw.html. This would be wonderful if ffmpeg could
> define something similar. Pay attention that in gstreamer you clearly
> get understanding what is "default" NV12 memory layout is: it is
> contiguous memory where UV plane is located right after the end of the
> Y plane, no additional padding for width and height. 

It's unclear what you are gaining from this being defined as the "default".  If 
that's defined in the API then it constrains the allocator to only return 
frames of that form (so you can't do things like optimising pitches to avoid 
cache way collisions), but offers little benefit to the user unless they happen 
to have input in exactly that form somehow (say because they read a raw file 
with identical layout).  (And it's irrelevant for any component, since they 
have to deal with the general case anyway.)

> I have tried to find something similar for ffmpeg, but failed (except
> the ffmpeg source code of course).

The pixdesc descriptors (libavutil/pixdesc.[ch]) contain all the relevant 
information about pixel formats in a programmatic form, so some information 
about that could probably be auto-generated.  (Relatedly, that gets used to 
test which formats are representable on an OpenCL device with a given set of 
capabilities (e.g. CL_R and CL_RG in CL_UNORM_INT8 -> NV12 support).  
Unfortunately very few APIs present this information in a machine-readable 
form, so generally the formats are hard-coded from fourccs or something like 
that.)

Thanks,

- Mark
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Reply via email to