On Mon, 19 Aug 2024, Ramiro Polla wrote:
If the stride is a negative number, the first sxtw does the right thing,
but the "lsl w1, w1, #1" will zero out the upper half of the register.
I'll start adding negative stride tests to checkasm to spot these bugs.
That's probably useful. The other alternative is to transition these cases
to use ptrdiff_t for the stride, which should be register sized, so most
of the sign extension issues around strides go away. (We've transitioned
lots of preexisting DSP interfaces already, so doing that here would just
be the next logical step. But at times, this may require marginal
touch-ups to existing assembly, or at least allows getting rid of such
sign extensions later.)
With this, I'm down from your 120 cycles on the A53 originally, to 78.7
cycles now. Your fully unrolled version seemed to run in 72 cycles on the
A53, so that's obviously even faster, but I think this kind of tradeoff
might be the sweet spot. What does such a version give you in terms of
real world speed?
This version is around 0.5% slower overall on the A76. Very roughly
these are the total times taken by pix_sum and pix_norm1 with the
different implementations on A76:
c: ~5%
fully unrolled: ~3%
unroll 2: 2.5%
tight loop: 2%
Ok. Given the tradeoff between various different cores (including ones not
tested here), do you think this version would be a reasonable compromise
(giving almost ideal results on in-order cores, and not too much slowdown
on out-of-order cores in this benchmark)?
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".