On Sun, 18 Aug 2024, Ramiro Polla wrote:

I had tested the real world case on the A76, but not on the A53. I
spent a couple of hours with perf trying to find the source of the
discrepancy but I couldn't find anything conclusive. I need to learn
more about how to test cache misses.

Nah, I guess that's a bit overkill...

I just tested again with the following command:
$ taskset -c 2 ./ffmpeg_g -benchmark -f lavfi -i
"testsrc2=size=1920x1080" -vcodec mpeg4 -q 31 -vframes 100 -f rawvideo
-y /dev/null

The entire test was about 1% faster unrolled on A53, but about 1%
slower unrolled on A76 (I had the Raspberry Pi 5 in mind for these
optimizations, so I preferred choosing the version that was faster on
the A76).

I wonder if there is any way we could check at runtime.

There are indeed often cases where functions could be tuned differently for older/newer or in-order/out-of-order cores. In most cases, trying to specialize things is a bit waste and overkill though - in most cases, I'd just suggest going with a compromise.

(Sometimes, different kinds of tunings can be applied if you use e.g. the flag dotprod to differentiate between older and newer cores. But it's seldom worth the extra effort to do that.)


Right, so looking at your unrolled case, you've done a full unroll. That's probably also a bit overkill.

The in-order cores really hate tight loops where almost everything has a sequential dependency on the previous instruction - so the general rule of thumb is that you'll want to unroll by a factor of 2, unless the algorithm itself has enough complexity that there's two separate dependency chains interlinked.

Also, from your unrolled version, there's a slight bug in it:

+        add             x2, x0, w1, sxtw
+        lsl             w1, w1, #1

If the stride is a negative number, the first sxtw does the right thing, but the "lsl w1, w1, #1" will zero out the upper half of the register.

So for that, you'd still need to keep the "sxtw x1, w1" instruction, and do the lsl on x1 instead. It is actually possible to merge it into one instruction though, with "sbfiz x1, x1, #1, #32", if I read the docs right. But that's a much more uncommon instruction...

As for optimal performance here - I tried something like this:

        movi            v0.16b, #0
        add             x2, x0, w1, sxtw
        sbfiz           x1, x1, #1, #32
        mov             w3, #16

1:
        ld1             {v1.16b}, [x0], x1
        ld1             {v2.16b}, [x2], x1
        subs            w3, w3, #2
        uadalp          v0.8h, v1.16b
        uadalp          v0.8h, v2.16b
        b.ne            1b

        uaddlv          s0, v0.8h
        fmov            w0, s0

        ret

With this, I'm down from your 120 cycles on the A53 originally, to 78.7 cycles now. Your fully unrolled version seemed to run in 72 cycles on the A53, so that's obviously even faster, but I think this kind of tradeoff might be the sweet spot. What does such a version give you in terms of real world speed?

On this version, you can also note that the two sequential uadalp may seem a little potentially problematic. I did try using two separate accumulator registers, accumulating into v0 and v1 separately here, and only summing them at the end. That didn't make any difference, so the A53 may potentially have a special case where two such sequential accumulations into the same register doesn't incur the extra full latency. (The A53 does have such a case for "mla" at least.)

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to