idctdsp: Arm 64-bit NEON block add and clamp fast paths

Martin Storsjö Thu, 31 Mar 2022 14:42:45 -0700

On Thu, 31 Mar 2022, Ben Avison wrote:

On 30/03/2022 15:14, Martin Storsjö wrote:
On Fri, 25 Mar 2022, Ben Avison wrote:
+// Clamp 16-bit signed block coefficients to signed 8-bit (biased by 128)
+// On entry:
+//   x0 -> array of 64x 16-bit coefficients
+//   x1 -> 8-bit results
+//   x2 = row stride for results, bytes
+function ff_put_signed_pixels_clamped_neon, export=1
+        ld1             {v0.16b, v1.16b, v2.16b, v3.16b}, [x0], #64
+        movi            v4.8b, #128
+        ld1             {v16.16b, v17.16b, v18.16b, v19.16b}, [x0]
+        sqxtn           v0.8b, v0.8h
+        sqxtn           v1.8b, v1.8h
+        sqxtn           v2.8b, v2.8h
+        sqxtn           v3.8b, v3.8h
+        sqxtn           v5.8b, v16.8h
+        add             v0.8b, v0.8b, v4.8b
Here you could save 4 add instructions with sqxtn2 and adding .16b vectors,but I'm not sure if it's wortwhile. (It reduces the checkasm numbers by 0.7for Cortex A72, by 0.3 for A73, but increases the runtime by 1.0 on A53.)Stranegely enough, I get much smaller numbers on my A72 than you got.
That's weird. As you say, it should be independent of clock-frequency. FWIW,I'm benchmarking on a Raspberry Pi 4; I'd assume all its board variants'Cortex-A72 cores are of identical revision.
Now I run it again, I'm getting these figures:

idctdsp.add_pixels_clamped_c: 313.3
idctdsp.add_pixels_clamped_neon: 24.3
idctdsp.put_pixels_clamped_c: 220.3
idctdsp.put_pixels_clamped_neon: 15.5
idctdsp.put_signed_pixels_clamped_c: 210.5
idctdsp.put_signed_pixels_clamped_neon: 19.5
which is more in line with what you see! I am getting a lot of variabilitybetween runs though - from a small sample, I'm seeing add_pixels_clamped_neoncoming out as anything from 21 to 30, which is well above the sort ofdifferences you're seeing between alternate implementations.

That's indeed weird. I don't have a Raspberry Pi 4 myself though, but forfunctions in this size range on the devboards I test on, I get essentiallyperfectly stable numbers each time - which is great for empiricallytesting different implementation strategies.

This sort of case is always going to be difficult to schedule optimally formultiple core - factors like how much dual-issuing is possible, latencybefore values can be used, load speed and the granularity of scoreboardingparts of vectors, all vary widely.

Yup, indeed. In most cases, an implementation that is good for one core isusually decent for the other ones as well, but sometimes it ends up acompromise, where optimizing for one makes things worse for another one.As long as the chosen implementation isn't very suboptimal for some commoncores, it probably doesn't matter much though.


// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 08/10] avcodec/idctdsp: Arm 64-bit NEON block add and clamp fast paths

Reply via email to