On Sat, 31 May 2025, Dmitriy Kovalenko wrote:
I've found quite a few ways to optimize existing ffmpeg's rgb to yuv
subsampled conversion. In this patch stack I'll try to
improve the perofrmance.
This particular set of changes is a small improvement to all the
existing functions and macro. The biggest performance gain is
coming from post loading increment of the pointer and immediate
~~prefetching of the memory blocks~~(was moved to the next patch in the stack)
and interleaving the multiplication shifting operations of
different registers for better scheduling.
Why keep the mention of prefetching here, when it no longer is included in
the patch at all? This is what you suggest is encoded as the final,
immutable commit message describing this change.
I have further inline comments below, please read them all.
Also changed a bunch of places where cmp + b.le was used instead
of one instruction cbnz/tbnz and some other small cleanups.
Here are checkasm results on the macbook pro with the latest M4 max
<before>
bgra_to_uv_1080_c: 257.5 ( 1.00x)
bgra_to_uv_1080_neon: 211.9 ( 1.22x)
bgra_to_uv_1920_c: 467.1 ( 1.00x)
bgra_to_uv_1920_neon: 379.3 ( 1.23x)
bgra_to_uv_half_1080_c: 198.9 ( 1.00x)
bgra_to_uv_half_1080_neon: 125.7 ( 1.58x)
bgra_to_uv_half_1920_c: 346.3 ( 1.00x)
bgra_to_uv_half_1920_neon: 223.7 ( 1.55x)
<after>
bgra_to_uv_1080_c: 268.3 ( 1.00x)
bgra_to_uv_1080_neon: 176.0 ( 1.53x)
bgra_to_uv_1920_c: 456.6 ( 1.00x)
bgra_to_uv_1920_neon: 307.7 ( 1.48x)
bgra_to_uv_half_1080_c: 193.2 ( 1.00x)
bgra_to_uv_half_1080_neon: 96.8 ( 2.00x)
bgra_to_uv_half_1920_c: 347.2 ( 1.00x)
bgra_to_uv_half_1920_neon: 182.6 ( 1.92x)
With my proprietary test on IOS it gives around 70% of performance
improvement converting bgra 1920x1920 image to yuv420p
On my linux arm cortex-r processing the performance improvement not that
visible but still consistently faster by 5-10% than the current
implementation.
---
libswscale/aarch64/input.S | 143 +++++++++++++++++++++++--------------
1 file changed, 91 insertions(+), 52 deletions(-)
@@ -292,7 +330,7 @@ function ff_\fmt_rgb\()ToUV_neon, export=1
smaddl x8, w16, w10, x9 // x8 = ru * r + const_offset
smaddl x8, w17, w11, x8 // x8 += gu * g
smaddl x8, w4, w12, x8 // x8 += bu * b
- asr w8, w8, #9 // x8 >>= 9
+ asr x8, x8, #9 // x8 >>= 9
strh w8, [x0], #2 // store to dst_u
Here you _still_ have one instance of these unrelated changes left in your
patch.
smaddl x8, w16, w13, x9 // x8 = rv * r + const_offset
@@ -401,3 +439,4 @@ endfunc
DISABLE_DOTPROD
#endif
+
--
Here you are adding one unrelated empty line at the end of the file. Don't
include any unrelated changes in your patches.
Before sending a patch, do review it yourself first, checking for any such
unrelated stray changes.
Other than those details, the rest of the patch looks ok.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".