[FFmpeg-devel] [PATCH v2] avcodec/aarch64/vvc: Optimize NEON version of vvc_dmvr

2025-03-03 Thread Krzysztof Pyrkosz via ffmpeg-devel
This patch replaces blocks of instructions performing rounding and widening shifts with one-liners achieving the same result. Before and after on A78 dmvr_8_12x20_neon: 86.2 ( 6.90x) dmvr_8_20x12_neon: 94.8 ( 5.93x) dmvr_8_2

[FFmpeg-devel] [PATCH v2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12}

2025-03-03 Thread Krzysztof Pyrkosz via ffmpeg-devel
This patch replaces integer widening with halving addition, and multi-step "emulated" rounding shift with a single asm instruction doing exactly that. Benchmarks before and after: A78 avg_8_64x64_neon: 2686.2 ( 6.12x) avg_8_128x128_neon:

[FFmpeg-devel] [PATCH v2] swscale/aarch64: dotprod implementation of rgba32_to_Y

2025-03-03 Thread Krzysztof Pyrkosz via ffmpeg-devel
The idea is to split the 16 bit coefficients into lower and upper half, invoke udot for the lower half, shift by 8, and follow by udot for the upper half. Benchmark on A78: bgra_to_y_128_c: 682.0 ( 1.00x) bgra_to_y_128_neon:

[FFmpeg-devel] [PATCH] swscale/aarch64/hscale.S Refactor hscale_16_to_15__fs_4

2025-03-01 Thread Krzysztof Pyrkosz via ffmpeg-devel
Before/after: A78 hscale_16_to_15__fs_4_dstW_8_neon: 86.8 ( 1.72x) hscale_16_to_15__fs_4_dstW_24_neon:147.5 ( 2.73x) hscale_16_to_15__fs_4_dstW_128_neon: 614.0 ( 3.14x) hscale_16_to_15__fs_4_dstW_144_neon: 680.5 ( 3.18x)

[FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_extract_exponents

2025-02-28 Thread Krzysztof Pyrkosz via ffmpeg-devel
Before and after: A78 ac3_extract_exponents_n512_neon: 503.2 ( 3.36x) ac3_extract_exponents_n3072_neon: 2986.2 ( 3.35x) ac3_extract_exponents_n512_neon: 211.2 ( 8.02x) ac3_extract_exponents_n3072_neon: 1251.5 ( 8.

[FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_sum_square_butterfly_int32_neon

2025-02-28 Thread Krzysztof Pyrkosz via ffmpeg-devel
Before and after: A78 ac3_sum_square_bufferfly_int32_neon: 484.8 ( 2.00x) ac3_sum_square_bufferfly_int32_neon: 468.2 ( 2.08x) A72 ac3_sum_square_bufferfly_int32_neon: 793.6 ( 1.26x) ac3_sum_square_bufferfly_int32_neon: 527.3

[FFmpeg-devel] [PATCH] swscale/aarch64: dotprod implementation of rgba32_to_Y

2025-02-27 Thread Krzysztof Pyrkosz via ffmpeg-devel
--- I was curious whether it's possible to implement this function without any widening, and it turns out it not only is, but it's quite performant at the same time! The idea is to split the 16 bit coefficients into lower and upper half, invoke udot for the lower half, shift by 8, and follow by ud

[FFmpeg-devel] [PATCH] avutil/aarch64/tx_float_neon.S: clean up FFT4_X2

2025-02-25 Thread Krzysztof Pyrkosz via ffmpeg-devel
--- This patch rids the code from two tbl instructions and the shuffle table. There's no fneg v0.s[3] instruction unfortunately, so I negate the whole vector and copy the last element only. It's tricky to benchmark this little change but on average it seems to be beneficial. Krzysztof libavutil

[FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction

2025-02-20 Thread Krzysztof Pyrkosz via ffmpeg-devel
--- libavcodec/aarch64/vvc/inter.S | 73 ++ 1 file changed, 20 insertions(+), 53 deletions(-) diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S index b65920e640..09f0627b20 100644 --- a/libavcodec/aarch64/vvc/inter.S +++ b/libavcodec/aarc

[FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12}

2025-02-20 Thread Krzysztof Pyrkosz via ffmpeg-devel
--- libavcodec/aarch64/vvc/inter.S | 125 - 1 file changed, 122 insertions(+), 3 deletions(-) diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S index 0edc861f97..b65920e640 100644 --- a/libavcodec/aarch64/vvc/inter.S +++ b/libavcodec/aarc

[FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction

2025-02-19 Thread Krzysztof Pyrkosz via ffmpeg-devel
--- Before and after on A78 dmvr_8_12x20_neon: 86.2 ( 6.90x) dmvr_8_20x12_neon: 94.8 ( 5.93x) dmvr_8_20x20_neon: 141.5 ( 6.50x) dmvr_12_12x20_neon:158.

[FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12}

2025-02-19 Thread Krzysztof Pyrkosz via ffmpeg-devel
--- This patch replaces integer widening with halving addition, and multi-step "emulated" rounding shift with a single asm instruction doing exactly that. This pattern repeats in other functions in this file, I fixed some in the succeeding patch. There's a lot of performance to be gained there. I

[FFmpeg-devel] [PATCH] swscale/aarch64/rgb2rgb_neon: Implemented {yuyv, uyvy}toyuv{420, 422}

2025-02-13 Thread Krzysztof Pyrkosz via ffmpeg-devel
This patch succesfully passes the github pipeline. The previous one, which adds tests fails only the first check on linux x86, probably because of that mmx issue. The tiny patch in the second email chain (the one about right shift by 2) completes the checks chain as-is. Krzysztof --- libswscale

[FFmpeg-devel] [PATCH 2/2] swscale/aarch64/rgb2rgb_neon: Implemented {yuyv, uyvy}toyuv{420, 422}

2025-02-11 Thread Krzysztof Pyrkosz via ffmpeg-devel
--- libswscale/aarch64/rgb2rgb.c | 16 ++ libswscale/aarch64/rgb2rgb_neon.S | 262 ++ 2 files changed, 278 insertions(+) diff --git a/libswscale/aarch64/rgb2rgb.c b/libswscale/aarch64/rgb2rgb.c index 7e1dba572d..f474228298 100644 --- a/libswscale/aarch64/rgb2rgb.

[FFmpeg-devel] [PATCH 1/2] tests/checkasm/sw_rgb: Added {yuyv, uyvy}toyuv{420, 422} test cases

2025-02-11 Thread Krzysztof Pyrkosz via ffmpeg-devel
Splitting the previous patch into two. I noticed that on my x86 box, one of the newly added tests fail: MMXEXT: uyvytoyuv420_mmxext (sw_rgb.c:126) yuyvtoyuv420_mmxext (sw_rgb.c:126) - sw_rgb.uyvytoyuv [FAILED] SSE2, AVX and AVX2 are passing, though. --- tests/checkasm/sw_rgb.c |

[FFmpeg-devel] [PATCH] swscale/aarch64/rgb24toyv12: skip early right shift by 2

2025-02-11 Thread Krzysztof Pyrkosz via ffmpeg-devel
It's a minor improvement that shaves off 5-8% from the execution time. Instead of shifting by 2 right away and by 7 soon after, shift by 9 one time. Times before and after: A78: rgb24toyv12_16_200_neon: 5366.8 ( 3.62x) rgb24toyv12_128_60_neon:

[FFmpeg-devel] [PATCH] swscale/aarch64/rgb2rgb_neon: Implemented uyvytoyuv422

2025-02-11 Thread Krzysztof Pyrkosz via ffmpeg-devel
I forgot to include the benchmarks in the previous message, here they are: A78: uyvytoyuv420_neon:6112.5 ( 6.96x) uyvytoyuv422_neon:6696.0 ( 6.32x) yuyvtoyuv420_neon:6113.0 ( 6.95x) yuyvtoyu

Re: [FFmpeg-devel] [PATCH] swscale/aarch64/rgb2rgb_neon: Implemented uyvytoyuv422

2025-02-11 Thread Krzysztof Pyrkosz via ffmpeg-devel
On Mon, Feb 10, 2025 at 03:15:35PM +0200, Martin Storsjö wrote: > > Just as I'm about to send this patch, I'm thinking if non-interleaved > > read followed by 4 invocations of TBL wouldn't be more performant. One > > call to generate a contiguous vector of u, second for v and two for y. > > I'm cur

Re: [FFmpeg-devel] [PATCH] avcodec/aarch64/opusdsp_neon: Simplify opus_postfilter_neon

2025-02-08 Thread Krzysztof Pyrkosz via ffmpeg-devel
On Sat, Feb 08, 2025 at 01:59:32AM +0100, Lynne wrote: > On 07/02/2025 20:42, Krzysztof Pyrkosz via ffmpeg-devel wrote: > > This change removes one extra floating point operation and simplifies > > load operations at the beginning of the loop by using dedicated register > &

[FFmpeg-devel] [PATCH] avcodec/aarch64/opusdsp_neon: Simplify opus_postfilter_neon

2025-02-07 Thread Krzysztof Pyrkosz via ffmpeg-devel
This change removes one extra floating point operation and simplifies load operations at the beginning of the loop by using dedicated register for each of the 5 pointers and interleaving it with calculations. The first case seems to be a bit slower, but the performance increase is substantial in th

[FFmpeg-devel] [PATCH] swscale/aarch64/rgb2rgb_neon: Implemented uyvytoyuv422

2025-02-07 Thread Krzysztof Pyrkosz via ffmpeg-devel
The patch contains NEON code that splits the uyvy input array into 3 separate buffers. The existing test cases are covering scenarios with odd height and odd stride, but width is even in every instance. Is it safe to make that assumption about the width? Just as I'm about to send this patch, I'm

[FFmpeg-devel] [PATCH] swscale/aarch64/output.S: refactor ff_yuv2plane1_8_neon

2025-01-31 Thread Krzysztof Pyrkosz via ffmpeg-devel
The benchmarks (before vs after) were gathered using ./tests/checkasm/checkasm --test=sw_scale --bench --runs=6 | grep yuv2yuv1 A78 before: yuv2yuv1_0_512_accurate_c:2039.5 ( 1.00x) yuv2yuv1_0_512_accurate_neon: 385.5 ( 5.29x) yuv2yuv1_0_512_ap

[FFmpeg-devel] [PATCH] swscale/aarch64/rgb2rgb: Implemented NEON shuf routines

2025-01-28 Thread Krzysztof Pyrkosz via ffmpeg-devel
The key idea is to pass the pre-generated tables to the TBL instruction and churn through the data 16 bytes at a time. The remaining 4 elements are handled with a specialized block located at the end of the routine. The 3210 variant can be implemented using rev32, but surprisingly it is slower tha

Re: [FFmpeg-devel] [PATCH] avcodec/aarch64/aacencdsp: NEON implementation

2025-01-27 Thread Krzysztof Pyrkosz via ffmpeg-devel
On Sun, Jan 26, 2025 at 01:29:38AM +0200, Martin Storsjö wrote: > With the following diff: > > @@ -40,8 +41,8 @@ function ff_aac_quant_bands_neon, export=1 > moviv5.4s, 0x80, lsl #24 > .irp signed,1,0 > \signed: > -subsw3, w3, #4 > ld1

[FFmpeg-devel] [PATCH] avcodec/aarch64/aacencdsp: NEON implementation

2025-01-24 Thread Krzysztof Pyrkosz via ffmpeg-devel
This patch supplies handwritten NEON code for AAC. The benchmarks below were collected by invoking these two commands on each of my boards, A78, A72 and Thinkpad x13s: 1) ./tests/checkasm/checkasm --test=aacencdsp --bench --runs=12 2) ./ffmpeg -y -t 10:00 -f lavfi -i sine /tmp/foo.aac (the first l

Re: [FFmpeg-devel] [PATCH] avutil/aarch64/float_dsp_neon: Refactor ff_vector_fmul_add_neon

2025-01-23 Thread Krzysztof Pyrkosz via ffmpeg-devel
On Sun, Jan 19, 2025 at 10:57:57PM +0200, Martin Storsjö wrote: > On Sun, 19 Jan 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote: > > > Removed a branch, unrolled loop. Speed increase bumped from 3.95 to 5.60. > > On what core is that? Please quote the actual output incl

[FFmpeg-devel] [PATCH] avutil/aarch64/float_dsp_neon: Refactor ff_butterflies_float_neon

2025-01-19 Thread Krzysztof Pyrkosz via ffmpeg-devel
Modified the main loop to handle 8 floats in one go. A special case of length not being multiple of 8 is handled at the beginning. The speed increased from 3.90 to 4.50. Krzysztof --- libavutil/aarch64/float_dsp_neon.S | 30 ++ 1 file changed, 22 insertions(+), 8 dele

[FFmpeg-devel] [PATCH] avutil/aarch64/float_dsp_neon: Refactor ff_vector_fmul_add_neon

2025-01-19 Thread Krzysztof Pyrkosz via ffmpeg-devel
Removed a branch, unrolled loop. Speed increase bumped from 3.95 to 5.60. Krzysztof --- libavutil/aarch64/float_dsp_neon.S | 28 +++- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/libavutil/aarch64/float_dsp_neon.S b/libavutil/aarch64/float_dsp_neon.S i

[FFmpeg-devel] [PATCH] avutil/aarch64/float_dsp_neon: Refactor ff_vector_fmac_scalar_neon

2025-01-19 Thread Krzysztof Pyrkosz via ffmpeg-devel
--- libavutil/aarch64/float_dsp_neon.S | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/libavutil/aarch64/float_dsp_neon.S b/libavutil/aarch64/float_dsp_neon.S index 35e2715b87..b21f34c084 100644 --- a/libavutil/aarch64/float_dsp_neon.S +++ b/libavutil/aarch64/flo

[FFmpeg-devel] avutil/aarch64/float_dsp_neon: Refactor ff_vector_fmac_scalar_neon

2025-01-19 Thread Krzysztof Pyrkosz via ffmpeg-devel
Removed two redundant pointer arithmetic operations and split load section into two smaller ones. Speedup compared to C increased from 4.50 to 5.80. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel