mpegvideoencdsp: add neon implementations for pix_sum and pix_norm1

Martin Storsjö Mon, 19 Aug 2024 02:27:06 -0700

On Sun, 18 Aug 2024, Ramiro Polla wrote:

I had tested the real world case on the A76, but not on the A53. I
spent a couple of hours with perf trying to find the source of the
discrepancy but I couldn't find anything conclusive. I need to learn
more about how to test cache misses.


Nah, I guess that's a bit overkill...

I just tested again with the following command:
$ taskset -c 2 ./ffmpeg_g -benchmark -f lavfi -i
"testsrc2=size=1920x1080" -vcodec mpeg4 -q 31 -vframes 100 -f rawvideo
-y /dev/null

The entire test was about 1% faster unrolled on A53, but about 1%
slower unrolled on A76 (I had the Raspberry Pi 5 in mind for these
optimizations, so I preferred choosing the version that was faster on
the A76).

I wonder if there is any way we could check at runtime.

There are indeed often cases where functions could be tuned differentlyfor older/newer or in-order/out-of-order cores. In most cases, trying tospecialize things is a bit waste and overkill though - in most cases, I'djust suggest going with a compromise.

(Sometimes, different kinds of tunings can be applied if you use e.g. theflag dotprod to differentiate between older and newer cores. But it'sseldom worth the extra effort to do that.)

Right, so looking at your unrolled case, you've done a full unroll. That'sprobably also a bit overkill.

The in-order cores really hate tight loops where almost everything has asequential dependency on the previous instruction - so the general rule ofthumb is that you'll want to unroll by a factor of 2, unless the algorithmitself has enough complexity that there's two separate dependency chainsinterlinked.


Also, from your unrolled version, there's a slight bug in it:

+        add             x2, x0, w1, sxtw
+        lsl             w1, w1, #1

If the stride is a negative number, the first sxtw does the right thing,but the "lsl w1, w1, #1" will zero out the upper half of the register.

So for that, you'd still need to keep the "sxtw x1, w1" instruction, anddo the lsl on x1 instead. It is actually possible to merge it into oneinstruction though, with "sbfiz x1, x1, #1, #32", if I read the docsright. But that's a much more uncommon instruction...


As for optimal performance here - I tried something like this:

        movi            v0.16b, #0
        add             x2, x0, w1, sxtw
        sbfiz           x1, x1, #1, #32
        mov             w3, #16

1:
        ld1             {v1.16b}, [x0], x1
        ld1             {v2.16b}, [x2], x1
        subs            w3, w3, #2
        uadalp          v0.8h, v1.16b
        uadalp          v0.8h, v2.16b
        b.ne            1b

        uaddlv          s0, v0.8h
        fmov            w0, s0

        ret

With this, I'm down from your 120 cycles on the A53 originally, to 78.7cycles now. Your fully unrolled version seemed to run in 72 cycles on theA53, so that's obviously even faster, but I think this kind of tradeoffmight be the sweet spot. What does such a version give you in terms ofreal world speed?

On this version, you can also note that the two sequential uadalp may seema little potentially problematic. I did try using two separate accumulatorregisters, accumulating into v0 and v1 separately here, and only summingthem at the end. That didn't make any difference, so the A53 maypotentially have a special case where two such sequential accumulationsinto the same register doesn't incur the extra full latency. (The A53 doeshave such a case for "mla" at least.)


// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 2/7] avcodec/aarch64/mpegvideoencdsp: add neon implementations for pix_sum and pix_norm1

Reply via email to