mpegvideoencdsp: add neon implementations for pix_sum and pix_norm1

Martin Storsjö Mon, 19 Aug 2024 05:09:14 -0700

On Mon, 19 Aug 2024, Ramiro Polla wrote:

If the stride is a negative number, the first sxtw does the right thing,
but the "lsl w1, w1, #1" will zero out the upper half of the register.


I'll start adding negative stride tests to checkasm to spot these bugs.

That's probably useful. The other alternative is to transition these casesto use ptrdiff_t for the stride, which should be register sized, so mostof the sign extension issues around strides go away. (We've transitionedlots of preexisting DSP interfaces already, so doing that here would justbe the next logical step. But at times, this may require marginaltouch-ups to existing assembly, or at least allows getting rid of suchsign extensions later.)

With this, I'm down from your 120 cycles on the A53 originally, to 78.7
cycles now. Your fully unrolled version seemed to run in 72 cycles on the
A53, so that's obviously even faster, but I think this kind of tradeoff
might be the sweet spot. What does such a version give you in terms of
real world speed?


This version is around 0.5% slower overall on the A76. Very roughly
these are the total times taken by pix_sum and pix_norm1 with the
different implementations on A76:
c: ~5%
fully unrolled: ~3%
unroll 2: 2.5%
tight loop: 2%

Ok. Given the tradeoff between various different cores (including ones nottested here), do you think this version would be a reasonable compromise(giving almost ideal results on in-order cores, and not too much slowdownon out-of-order cores in this benchmark)?


// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 2/7] avcodec/aarch64/mpegvideoencdsp: add neon implementations for pix_sum and pix_norm1

Reply via email to