On Mon, Aug 19, 2024 at 11:46 AM Martin Storsjö <mar...@martin.st> wrote: > On Sun, 18 Aug 2024, Ramiro Polla wrote: > > I had tested the real world case on the A76, but not on the A53. I > > spent a couple of hours with perf trying to find the source of the > > discrepancy but I couldn't find anything conclusive. I need to learn > > more about how to test cache misses. > > Nah, I guess that's a bit overkill... > > > I just tested again with the following command: > > $ taskset -c 2 ./ffmpeg_g -benchmark -f lavfi -i > > "testsrc2=size=1920x1080" -vcodec mpeg4 -q 31 -vframes 100 -f rawvideo > > -y /dev/null > > > > The entire test was about 1% faster unrolled on A53, but about 1% > > slower unrolled on A76 (I had the Raspberry Pi 5 in mind for these > > optimizations, so I preferred choosing the version that was faster on > > the A76). > > > I wonder if there is any way we could check at runtime. > > There are indeed often cases where functions could be tuned differently > for older/newer or in-order/out-of-order cores. In most cases, trying to > specialize things is a bit waste and overkill though - in most cases, I'd > just suggest going with a compromise. > > (Sometimes, different kinds of tunings can be applied if you use e.g. the > flag dotprod to differentiate between older and newer cores. But it's > seldom worth the extra effort to do that.) > > > Right, so looking at your unrolled case, you've done a full unroll. That's > probably also a bit overkill. > > The in-order cores really hate tight loops where almost everything has a > sequential dependency on the previous instruction - so the general rule of > thumb is that you'll want to unroll by a factor of 2, unless the algorithm > itself has enough complexity that there's two separate dependency chains > interlinked. > > Also, from your unrolled version, there's a slight bug in it: > > > + add x2, x0, w1, sxtw > > + lsl w1, w1, #1 > > If the stride is a negative number, the first sxtw does the right thing, > but the "lsl w1, w1, #1" will zero out the upper half of the register.
I'll start adding negative stride tests to checkasm to spot these bugs. > So for that, you'd still need to keep the "sxtw x1, w1" instruction, and > do the lsl on x1 instead. It is actually possible to merge it into one > instruction though, with "sbfiz x1, x1, #1, #32", if I read the docs > right. But that's a much more uncommon instruction... > > As for optimal performance here - I tried something like this: > > movi v0.16b, #0 > add x2, x0, w1, sxtw > sbfiz x1, x1, #1, #32 > mov w3, #16 > > 1: > ld1 {v1.16b}, [x0], x1 > ld1 {v2.16b}, [x2], x1 > subs w3, w3, #2 > uadalp v0.8h, v1.16b > uadalp v0.8h, v2.16b > b.ne 1b > > uaddlv s0, v0.8h > fmov w0, s0 > > ret > > With this, I'm down from your 120 cycles on the A53 originally, to 78.7 > cycles now. Your fully unrolled version seemed to run in 72 cycles on the > A53, so that's obviously even faster, but I think this kind of tradeoff > might be the sweet spot. What does such a version give you in terms of > real world speed? This version is around 0.5% slower overall on the A76. Very roughly these are the total times taken by pix_sum and pix_norm1 with the different implementations on A76: c: ~5% fully unrolled: ~3% unroll 2: 2.5% tight loop: 2% _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".