> From: ffmpeg-devel [mailto:ffmpeg-devel-boun...@ffmpeg.org] On Behalf > Of Rogozhkin, Dmitry V > Sent: Friday, October 26, 2018 8:37 AM > To: ffmpeg-devel@ffmpeg.org > Subject: Re: [FFmpeg-devel] [FFmpeg-cvslog] lavu/qsv: make a copy as > libmfx alignment requirement for uploading > > On Wed, 2018-10-24 at 23:47 +0100, Mark Thompson wrote: > > On 24/10/18 21:45, Rogozhkin, Dmitry V wrote: > > > On Tue, 2018-10-23 at 23:46 +0100, Mark Thompson wrote: > > > > On 23/10/18 08:49, Li, Zhong wrote: > > > > > > From: ffmpeg-devel [mailto:ffmpeg-devel-boun...@ffmpeg.org] > > > > > > On > > > > > > Behalf > > > > > > Of Mark Thompson > > > > > > Sent: Sunday, October 14, 2018 12:36 AM > > > > > > To: ffmpeg-devel@ffmpeg.org > > > > > > Subject: Re: [FFmpeg-devel] [FFmpeg-cvslog] lavu/qsv: make a > > > > > > copy as libmfx alignment requirement for uploading > > > > > > > > > > > > On 11/10/18 06:38, Zhong Li wrote: > > > > > > > ffmpeg | branch: master | Zhong Li <zhong...@intel.com> | > > > > > > > Mon Sep 17 > > > > > > > 19:16:44 2018 +0800| > > > > > > > [681aa7d14f97fd98181ca6d61e11be48fe65692d] > > > > > > > > > > > > > > > > > > > > > > committer: Zhong Li > > > > > > > > > > > > > > lavu/qsv: make a copy as libmfx alignment requirement for > > > > > > > uploading > > > > > > > > > > > > > > Libmfx requires 16 bytes aligned input/output for uploading. > > > > > > > Currently only output is 16 byte aligned and assigning same > > > > > > > width/height to input with smaller buffer size actually, > > > > > > > thus definitely will > > > > > > > > > > > > cause segment fault. > > > > > > > > > > > > > > Can reproduce with any 1080p nv12 rawvideo input: > > > > > > > ffmpeg -init_hw_device qsv=qsv:hw -hwaccel qsv > > > > > > > -filter_hw_device qsv -f rawvideo -pix_fmt nv12 -s:v > > > > > > > 1920x1080 -i 1080p_nv12.yuv -vf > > > > > > > > 'format=nv12,hwupload=extra_hw_frames=16,hwdownload,format= > > > > > > > nv12 > > > > > > > ' > > > > > > > > > > > > -an > > > > > > > -y out_nv12.yuv > > > > > > > > > > > > > > It can fix #7418 > > > > > > > > > > > > > > Signed-off-by: Zhong Li <zhong...@intel.com> > > > > > > > > > > > > > > > > > > > > > > > > > > > http://git.videolan.org/gitweb.cgi/ffmpeg.git/?a=commit;h=681 > > > > > > aa7d > > > > > > 14f9 > > > > > > > > 7fd98181ca6d61e11be48fe65692d > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > libavutil/hwcontext_qsv.c | 31 > > > > > > > +++++++++++++++++++++++++++++-- > > > > > > > 1 file changed, 29 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > diff --git a/libavutil/hwcontext_qsv.c > > > > > > > b/libavutil/hwcontext_qsv.c index 33e121b416..814ce215ce > > > > > > > 100644 > > > > > > > --- a/libavutil/hwcontext_qsv.c > > > > > > > +++ b/libavutil/hwcontext_qsv.c > > > > > > > @@ -862,6 +862,10 @@ static int > > > > > > > > > > > > qsv_transfer_data_to(AVHWFramesContext *ctx, AVFrame *dst, > > > > > > > mfxSyncPoint sync = NULL; > > > > > > > mfxStatus err; > > > > > > > int ret = 0; > > > > > > > + /* make a copy if the input is not padded as libmfx > > > > > > > requires */ > > > > > > > + AVFrame tmp_frame, *src_frame; > > > > > > > + int realigned = 0; > > > > > > > + > > > > > > > > > > > > > > while (!s->session_upload_init && !s->session_upload && > > > > > > > !ret) { > > > > > > > #if HAVE_PTHREADS @@ -887,16 +891,36 @@ static int > > > > > > > qsv_transfer_data_to(AVHWFramesContext *ctx, AVFrame *dst, > > > > > > > if (ret < 0) > > > > > > > return ret; > > > > > > > > > > > > > > + > > > > > > > + if (src->height & 16 || src->linesize[0] & 16) { > > > > > > > > > > > > Doesn't this still need to check that the layout is compatible > > > > > > with other the limitations that libmfx has before trying to do > > > > > > the upload? In particular, that the pitches for the different > > > > > > planes are actually all the same - I certainly wouldn't expect > > > > > > them to be for YUV420P. > > > > > > > > > > Do you mean our previous discussion on https://lists.libav.org/ > > > > > pipe rmail/libav-devel/2018-April/086070.html ? > > > > > If yes, I will try to reproduce it with hwupload pipeline > > > > > instead of transcoding pipeline, and then give an update patch. > > > > > > > > Yes. You might need to write a small test program to generation > > > > an arbitrary allocation pattern, since libavutil only has a fixed > > > > layout. > > > > > > > > > > > > > > > > (IIRC there was a requirement 0 < frame->data[N] - frame- > > > > > > >data[0] > > > > > > < 2^24 > > > > > > as well (or something like that, I might be wrong), which also > > > > > > need not be > > > > > > true.) > > > > > > > > > > I don't know such a requirement, could it be found from MSDK > > > > > guide? > > > > > > > > That came from discussion of hardware capabilities around VAAPI, I > > > > think. > > > > > > There are upper size frame limitations which HW can support. And > > > this limitation can be different for different operations, for > > > example AVC encoder supports 4K as a maximum. Jpeg can support 16K. > > > This depend on the component and may depend on the generation of > > > hardware. In any case I don't think any of these kind of limitations > > > are something to be checked within ffmpeg during allocation. That's > > > responsibility of underlying implementation via which you try to > > > allocate to return an error to ffmpeg that allocation can not be > > > done. Or responsibility of the component (msdk or vaapi) to return > > > an error that processing of the given surface can not be done (due > > > to per-component limitations). > > > > I meant real hardware capabilities around layout, which are going to > > be a superset of anything which can actually be reported in any > > particular API (see below). > > > > > > > > > > Won't something like this be the reason for wanting a virtually > > > > contiguous buffer? If there isn't a restriction along these lines > > > > then there is no need to make the buffer virtually contiguous. > > > > (After all, the hardware can't know about C undefined behaviour in > > > > pointer comparisons...) > > > > > > "Virtually contiguous" sounds dangerous to me. There is a > > > discrepancy in treating layouts of color formats in ffmpeg and other > > > libraries (and drivers). First of all I did not see clear document > > > describing what ffmpeg thinks layout should be. What it seems to me > > > is that ffmpeg treats color formats per-plane meaning that in NV12 Y > > > and UV could be allocated absolutely separately. > > > > Correct. FFmpeg deliberately places little constraint on the real > > layout of an NV12 frame - the planes need not be in any particular > > order or place in memory, nor need the pitch have any particular value > > relative to the width. The one constraint we do have is that each > > line start at an address which is a multiple of some defined > > platform-specific alignment (and therefore plane pitches are multiples > > of such a value as well). > > > > > While you work with > the SW > > > components this might be ok. But HW defines pretty specifically what > > > it expects. > > > And typically memory layout is supposed to be contiguous, I mean > > > _physically_ contiguous. > > > > While this is true in a lot of implementations, it isn't in the ones > > being considered here. Modern Intel GPUs, including their video > > hardware, have an MMU with the same level of page access as the CPU > > does, so anything which is virtually contiguous from the CPU point of > > view certainly can be from the GPU as well. Objects which are > > physically contiguous might be desirable in some marginal cases for > > performance (desirable ordering properties for RAM chips?), but > > certainly aren't necessary. > > > > > Secondly, HW may have some > other > > > implications over layout requiring particular alignments. These > > > could be width alignments, i.e. pitches. Or height padding > > > alignments. As of now it seems that ffmpeg don't provide an > > > architecture to satisfy all these needs. This leads to overheads in > > > terms of memory consumption since additional "rightly allocated" > > > memory is required to copy frames and eventually this implies > > > processing overheads as well. > > > > These real constraints are mostly unsatifiable for reasonable use- > > cases of Intel GPUs. For example, H.264 encoding on Skylake requires > > Y-tiled NV12 surfaces with a number of additional constraints such as > > the chroma plane having the same pitch as the luma plane and being > > located at a line offset of at most 2^15-1. This particular > > constraint is the one I was thinking of above: given a pitch of (say) > > 2048 = 2^11, the chroma plane can't be offset from the luma plane by > > more than 2^26, and in the positive direction only - that pretty much > > means you need to allocate them virtually-contiguously, since planes > > allocated normally have no reason to satisfy that (especially with > > ASLR enabled). (This comes from <https://01.org/sites/default/files/ > > documentation/intel-gfx-prm-osrc-skl-vol02a-commandreference- > > instructions.pdf> - MFX_SURFACE_STATE beginning page 818 is > > relevant.) > > > > The Y-tiling in particular makes it very difficult for any upload case > > - there is no reasonable software implementation of the address > > swizzle needed to write Y-tiled planes. The one possible getout here > > is to use the MMU to get a linear map of the Y-tiled surface, but in > > general this causes problems because such a mapping is uncached and > > therefore has very low performance for reads. (You can still use it > > in FFmpeg with VAAPI if you want - the av_hwframe_map() function and > > the hwmap filter allow it, though it's only really usable for write- > > only surfaces.) > > > > Given that, we have to make a copy somewhere in almost all cases. I > > believe the change made here for libmfx is done so that the GPU can > > perform the copy itself, lifting the input data directly from CPU > > memory and therefore saving on the CPU time which would be used to > > perform the same copy. I can't find this particular case in the > > relevant source code, but I assume if will have some similar > > constraints to the above video case - in particular, the chroma plane > > location relative to the luma plane is probably constrained in the > > same way (I admit I've extrapolated that from the virtual-contiguity > > requirement, but I don't see where else it would come from). > > > > > Here is an example from gstreamer where they describe what they mean > > > under color formats: https://gstreamer.freedesktop.org/documentatio > > > n/de > > > sign/mediatype-video-raw.html. This would be wonderful if ffmpeg > > > could define something similar. Pay attention that in gstreamer you > > > clearly get understanding what is "default" NV12 memory layout is: > > > it is contiguous memory where UV plane is located right after the > > > end of the Y plane, no additional padding for width and height. > > > > It's unclear what you are gaining from this being defined as the > > "default". If that's defined in the API then it constrains the > > allocator to only return frames of that form (so you can't do things > > like optimising pitches to avoid cache way collisions), but offers > > little benefit to the user unless they happen to have input in exactly > > that form somehow (say because they read a raw file with identical > > layout). (And it's irrelevant for any component, since they have to > > deal with the general case anyway.) > > I did not try to make a point: "let's to allocation in this way and call it > default". I was trying to say something else. Let me try to explain. > > ffmpeg components may not directly implement codecs, but use some 3rd > party libraries underneath (which may or may not) use hardware. I.e. > software/hardware stack below is possible and this stack may imply certain > limitations and requirements over memory layouts which may not be > matched by ffmpeg. Thus, I believe it is desirable if ffmpeg will do > 2 things: > > 1. Document and provide a way to programmatically check what is the > layout of the allocated memory (what are planes sizes, relative offsets, > etc.). > > 2. Provide a way to allocate memory in a way wanted by the underlying > stack of one of the pipeline's components to avoid additional copy > operations. For example, if you build a pipeline of 2 components A->B and B > requires specific layout L for some color format, but A does not care, then it > is better to allocate in a B fashion.
Ideally, it should be. But in fact it is hard (or impossible). Because: 1. It is required all 3rd party libraries have an API to expose their memory requirement. This is out of ffmpeg's control. 2. FFmpeg pipeline is very flexible. The next step of A may be B, C and D. B/C/D may have different memory requirement. Stand the point of A, how it know what is the next step and give a perfect memory allocation? However, for the NV12/NV21 cases, thus should be much simple. Continuous memory for NV12/NV21 is not just required as MSDK or Intel HW. It is a requirement of fourcc.org: https://www.fourcc.org/pixel-format/yuv-nv12/: "A format in which all Y samples are found first in memory as an array of unsigned char with an even number of lines (possibly with a larger stride for memory alignment), _followed immediately_ by an array of unsigned char containing interleaved Cb and Cr samples (such that if addressed as a little-endian WORD type, Cb would be in the LSBs and Cr would be in the MSBs) with the same total stride as the Y samples. This is the preferred 4:2:0 pixel format." The key words "_followed immediately_ " means all NV12 memory should be continuous at the start of allocation, not only gstreamer, but also all application including ffmpeg. > Yes, incompatibilities are possible and copy operations generally speaking > are inevitable. But framework responsibility is to try to build an effective > pipeline taking care of each component capabilities and preferences. > Memory layout is one of such capabilities, different l ayouts generally > speaking mean different color formats. Number of surfaces to allocate is > another example. > > > > > > I have tried to find something similar for ffmpeg, but failed > > > (except the ffmpeg source code of course). > > > > The pixdesc descriptors (libavutil/pixdesc.[ch]) contain all the > > relevant information about pixel formats in a programmatic form, so > > some information about that could probably be auto- generated. > > (Relatedly, that gets used to test which formats are representable on > > an OpenCL device with a given set of capabilities (e.g. CL_R and CL_RG > > in CL_UNORM_INT8 -> NV12 support). Unfortunately very few APIs > > present this information in a machine-readable form, so generally the > > formats are hard-coded from fourccs or something like that.) > > > > Thanks, > > > > - Mark _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel