vc1: Arm 64-bit NEON inverse transform fast paths

Martin Storsjö Thu, 31 Mar 2022 14:32:35 -0700

On Thu, 31 Mar 2022, Ben Avison wrote:

On 30/03/2022 14:49, Martin Storsjö wrote:
Looks generally reasonable. Is it possible to factorize out the individualtransforms (so that you'd e.g. invoke the same macro twice in the 8x8 and4x4 functions) without too much loss?
There is a close analogy here with the vertical/horizontal deblockingfilters, because while there are similarities between the two matrixmultiplications within a transform, one of them follows a series of loads andthe other follows a matrix transposition.
If you look for example at ff_vc1_inv_trans_8x8_neon, you'll see I was ableto do a fair amount of overlap between sections of the function -particularly between the transpose and the second matrix multiplication, butto a lesser extent between the loads and the first matrix multiplication andbetween the second multiplication and the stores. This sort of overlapping istricky to maintain when using macros. Also, it means the the order ofoperations within each matrix multiply ended up quite different.
At first sight, you might think that the multiplies from the 8x8 function(which you might also view as kind of 8-tap filter) would be re-usable forthe size-8 multiplies in the 8x4 or 4x8 function. Yes, the instructions aresimilar, save for using .4h elements rather than .8h elements, but that hassignificant impacts on scheduling. For example, the Cortex-A72, which is myprimary target, can only do NEON bit-shifts in one pipeline at once,irrespective of whether the vectors are 64-bit or 128-bit long, while otherinstructions don't have such restrictions.
So while in theory you could factor some of this code out more, I suspect anyattempt to do so would have a detrimental effect on performance.

Ok, fair enough. Yes, it's always a trade off between code simplicity andgetting the optimal interleaving. As you've spent the effort on making itefficient with respect to that, let's go with that then!

(FWIW, for future endeavours, having the checkasm tests in place whiledeveloping/tuning the implementation does allow getting good empiricaldata on how much you gain from different alternative scheduling choices. Iusually don't follow the optimization guides for any specific core, buttrack the benchmark numbers for a couple different cores and try to pick ascheduling that is a decent compromise for all of them.)

Also, for future work - if you have checkasm tests in place while workingon the assembly, I usually amend the test with debug printouts thatvisualize the output of the reference and the tested function, and a mapshowing which elements differ - which makes tracking down issues a wholelot easier. I don't think any of the checkasm tests in ffmpeg have suchprintouts though, but within e.g. the dav1d project, the checkasm tool isextended with helpers for comparing and printing such debug aids.


// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 07/10] avcodec/vc1: Arm 64-bit NEON inverse transform fast paths

Reply via email to