On Thu, 31 Mar 2022, Ben Avison wrote:
On 30/03/2022 15:14, Martin Storsjö wrote:
On Fri, 25 Mar 2022, Ben Avison wrote:
+// Clamp 16-bit signed block coefficients to signed 8-bit (biased by 128)
+// On entry:
+// x0 -> array of 64x 16-bit coefficients
+// x1 -> 8-bit results
+// x2 = row stride for results, bytes
+function ff_put_signed_pixels_clamped_neon, export=1
+ ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [x0], #64
+ movi v4.8b, #128
+ ld1 {v16.16b, v17.16b, v18.16b, v19.16b}, [x0]
+ sqxtn v0.8b, v0.8h
+ sqxtn v1.8b, v1.8h
+ sqxtn v2.8b, v2.8h
+ sqxtn v3.8b, v3.8h
+ sqxtn v5.8b, v16.8h
+ add v0.8b, v0.8b, v4.8b
Here you could save 4 add instructions with sqxtn2 and adding .16b vectors,
but I'm not sure if it's wortwhile. (It reduces the checkasm numbers by 0.7
for Cortex A72, by 0.3 for A73, but increases the runtime by 1.0 on A53.)
Stranegely enough, I get much smaller numbers on my A72 than you got.
That's weird. As you say, it should be independent of clock-frequency. FWIW,
I'm benchmarking on a Raspberry Pi 4; I'd assume all its board variants'
Cortex-A72 cores are of identical revision.
Now I run it again, I'm getting these figures:
idctdsp.add_pixels_clamped_c: 313.3
idctdsp.add_pixels_clamped_neon: 24.3
idctdsp.put_pixels_clamped_c: 220.3
idctdsp.put_pixels_clamped_neon: 15.5
idctdsp.put_signed_pixels_clamped_c: 210.5
idctdsp.put_signed_pixels_clamped_neon: 19.5
which is more in line with what you see! I am getting a lot of variability
between runs though - from a small sample, I'm seeing add_pixels_clamped_neon
coming out as anything from 21 to 30, which is well above the sort of
differences you're seeing between alternate implementations.
That's indeed weird. I don't have a Raspberry Pi 4 myself though, but for
functions in this size range on the devboards I test on, I get essentially
perfectly stable numbers each time - which is great for empirically
testing different implementation strategies.
This sort of case is always going to be difficult to schedule optimally for
multiple core - factors like how much dual-issuing is possible, latency
before values can be used, load speed and the granularity of scoreboarding
parts of vectors, all vary widely.
Yup, indeed. In most cases, an implementation that is good for one core is
usually decent for the other ones as well, but sometimes it ends up a
compromise, where optimizing for one makes things worse for another one.
As long as the chosen implementation isn't very suboptimal for some common
cores, it probably doesn't matter much though.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".