On Tue, 28 Jan 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote:
The key idea is to pass the pre-generated tables to the TBL instruction
and churn through the data 16 bytes at a time. The remaining 4 elements
are handled with a specialized block located at the end of the routine.
The 3210 variant can be implemented using rev32, but surprisingly it is
slower than the generic TBL on A78, but much faster on A72. I wrapped it
in #if 0 block.
So the tradeoff is essentially this:
A78:
shuffle_bytes_3210_c: 138.0 ( 1.00x)
shuffle_bytes_3210_tbl_neon: 22.0 ( 6.27x)
shuffle_bytes_3210_rev32_neon: 28.5 ( 4.88x)
A72:
shuffle_bytes_3210_c: 195.8 ( 1.00x)
shuffle_bytes_3210_tbl_neon: 37.8 ( 5.19x)
shuffle_bytes_3210_rev32_neon: 30.8 ( 6.33x)
Yeah it doesn't really make much of a difference which on we pick here;
we're much faster than the C code in any case. I guess favouring tbl for
the newer cores is the right choice to make.
diff --git a/libswscale/aarch64/rgb2rgb_neon.S
b/libswscale/aarch64/rgb2rgb_neon.S
index 1382e00261..a69a211ad4 100644
--- a/libswscale/aarch64/rgb2rgb_neon.S
+++ b/libswscale/aarch64/rgb2rgb_neon.S
@@ -296,3 +359,99 @@ function ff_deinterleave_bytes_neon, export=1
0:
ret
endfunc
+
+.macro neon_shuf shuf
+function ff_shuffle_bytes_\shuf\()_neon, export=1
+ movrel x9, shuf_\shuf\()_tbl
+ ld1 {v1.16b}, [x9]
+ and w5, w2, #~15
+ and w3, w2, #8
+ and w4, w2, #4
+ cbz w5, 2f
+1:
+ subs w5, w5, #16
+ ld1 {v0.16b}, [x0], #16
+ tbl v0.16b, {v0.16b}, v1.16b
+ st1 {v0.16b}, [x1], #16
+ b.gt 1b
By moving the subs to after the ld1, on the Cortex A53, I get the runtime
lowered from this:
shuffle_bytes_0321_c: 283.0 ( 1.00x)
shuffle_bytes_0321_neon: 68.0 ( 4.16x)
to this:
shuffle_bytes_0321_neon: 60.8 ( 4.66x)
So I'm squashing such a change into it.
+#
+#if 0
+function ff_shuffle_bytes_3210_neon, export=1
+ and w5, w2, #~(15)
While it is nice to keep this as reference, it's kinda dead code here, so
I would suggest we just drop it for now. Good that you investigated it
though!
Other than that, this looks really good, so I'll push it with those
changes.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".