vc1: Arm 64-bit NEON deblocking filter fast paths

Martin Storsjö Thu, 31 Mar 2022 14:21:55 -0700

On Thu, 31 Mar 2022, Ben Avison wrote:

On 30/03/2022 13:35, Martin Storsjö wrote:
Overall, the code looks sensible to me.
Would it make sense to share the core of the filter between thehorizontal/vertical cases with e.g. a macro? (I didn't check in detail ifthere's much differences in the core of the filter. At most somedifferences in condition registers for partial writeout in the horizontalforms?)
Well, looking at the comments at the right-hand side of the source, whichgive the logical meaning of the results of each instruction, I admit there'sa resemblance in the middle of the 8-pixel-pair function.

Actually, I didn't try to follow/compare it to that level, I just assumedthem to be similar.

However, the physical register assignments are quite different, andattempting to reassign the registers in one to match the other isn't atrivial task. It's hard enough when you start register assignment fromthe top of a function and work your way down, as I have done here.
In the 16-pixel-pair case, the fact that the input values arrive in adifferent order as the result of them, in one case, being loaded inregularly-increasing address order, and in the other, falling out of a matrixtransposition, has resulted in even the logical order of instructions beingquite different in the two cases.
In the 4-pixel-pair case, the values are packed differently into registers inthe two cases, because in the v case, we're loading 4 pixels betweenrow-strides, which means it's easy to place each row in its own vector,whereas in the h case we load 4 rows of 8 pixels each and transpose, whichleaves the values in 4 vectors rather than 8. Some of the filtering steps canbe performed with the data packed in this way (calculating a1 and a2) whilewaiting for it to be restructured in order to calculate the other metrics,but it's not worth packing the data together in this way in the v case giventhat it starts off already separated. So the two implementations end up quitedifferent in the operations they perform, not just the scheduling ofinstructions and in register assignment terms.
Some background: as you may have guessed, I didn't start out writing thesefunctions as they currently appear. Prototype versions didn't care much forscheduling or keeping to a small number of registers. They were primarily forchecking the correctness of the mathematics, and they'd use all availablevectors, sometimes shuffling values between registers or to the stack to makeroom. Once I'd verified correctness, I then reworked them to keep to aminimal number of registers and to minimise stalls as far as possible.
I'm targeting the Cortex-A72, since that's what the Raspberry Pi 4 uses andit's on the cusp of having enough power to decode VC-1 BluRay streams, so Ideliberately didn't take too much consideration of the requirements ofearlier cores. Yes, it's an out-of-order core, but I reckoned there areprobably limits to how wisely it can select instructions to execute (therehave got to be limits to instruction queue lengths, for example). So based onthe pipeline structure documented in Arm's Cortex-A72 software opimizationguide, I arranged the instructions to best keep all pipelines busy as much aspossible, then assigned registers to keep the instructions in this order.
For the most part, I was able to keep the number of vectors used low enoughthat no callee-saving was required - or failing that, at least avoidinghaving to spill values to the stack mid-function. But it came pretty close attimes - witness for example the peculiar order in which vectors had to beloaded in the AArch32 version of ff_vc1_h_loop_filter16_neon. There's reasonbehind that!
In short, I'd really rather not tamper with these larger assembly functionsany more unless I really have to.


Ok, fair enough.

FWIW, my point of view was from implementing the loop filters for VP9 andAV1, where I did the core filter as one shared implementation for bothvariants, and where the frontend functions just load (and transpose) datainto the registers used as input for the common core filter, and viceversa.

But I presume that a custom implementation for each of them can be moreoptimal, at the cost of more code to maintain (but if there are no bugs,it usually doesn't need maintainance either).


Thus - fair enough, this code probably is ok then.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 05/10] avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths

Reply via email to