On Mon, 19 Aug 2024, Ramiro Polla wrote:

If the stride is a negative number, the first sxtw does the right thing,
but the "lsl w1, w1, #1" will zero out the upper half of the register.

I'll start adding negative stride tests to checkasm to spot these bugs.

That's probably useful. The other alternative is to transition these cases to use ptrdiff_t for the stride, which should be register sized, so most of the sign extension issues around strides go away. (We've transitioned lots of preexisting DSP interfaces already, so doing that here would just be the next logical step. But at times, this may require marginal touch-ups to existing assembly, or at least allows getting rid of such sign extensions later.)

With this, I'm down from your 120 cycles on the A53 originally, to 78.7
cycles now. Your fully unrolled version seemed to run in 72 cycles on the
A53, so that's obviously even faster, but I think this kind of tradeoff
might be the sweet spot. What does such a version give you in terms of
real world speed?

This version is around 0.5% slower overall on the A76. Very roughly
these are the total times taken by pix_sum and pix_norm1 with the
different implementations on A76:
c: ~5%
fully unrolled: ~3%
unroll 2: 2.5%
tight loop: 2%

Ok. Given the tradeoff between various different cores (including ones not tested here), do you think this version would be a reasonable compromise (giving almost ideal results on in-order cores, and not too much slowdown on out-of-order cores in this benchmark)?

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to