This patch replaces blocks of instructions performing rounding and
widening shifts with one-liners achieving the same result.
Before and after on A78
dmvr_8_12x20_neon: 86.2 ( 6.90x)
dmvr_8_20x12_neon: 94.8 ( 5.93x)
dmvr_8_2
This patch replaces integer widening with halving addition, and
multi-step "emulated" rounding shift with a single asm instruction doing
exactly that.
Benchmarks before and after:
A78
avg_8_64x64_neon: 2686.2 ( 6.12x)
avg_8_128x128_neon:
The idea is to split the 16 bit coefficients into lower and upper half,
invoke udot for the lower half, shift by 8, and follow by udot for the
upper half.
Benchmark on A78:
bgra_to_y_128_c: 682.0 ( 1.00x)
bgra_to_y_128_neon:
Before/after:
A78
hscale_16_to_15__fs_4_dstW_8_neon: 86.8 ( 1.72x)
hscale_16_to_15__fs_4_dstW_24_neon:147.5 ( 2.73x)
hscale_16_to_15__fs_4_dstW_128_neon: 614.0 ( 3.14x)
hscale_16_to_15__fs_4_dstW_144_neon: 680.5 ( 3.18x)
Before and after:
A78
ac3_extract_exponents_n512_neon: 503.2 ( 3.36x)
ac3_extract_exponents_n3072_neon: 2986.2 ( 3.35x)
ac3_extract_exponents_n512_neon: 211.2 ( 8.02x)
ac3_extract_exponents_n3072_neon: 1251.5 ( 8.
Before and after:
A78
ac3_sum_square_bufferfly_int32_neon: 484.8 ( 2.00x)
ac3_sum_square_bufferfly_int32_neon: 468.2 ( 2.08x)
A72
ac3_sum_square_bufferfly_int32_neon: 793.6 ( 1.26x)
ac3_sum_square_bufferfly_int32_neon: 527.3
---
I was curious whether it's possible to implement this function without
any widening, and it turns out it not only is, but it's quite
performant at the same time!
The idea is to split the 16 bit coefficients into lower and upper half,
invoke udot for the lower half, shift by 8, and follow by ud
---
This patch rids the code from two tbl instructions and the shuffle
table. There's no fneg v0.s[3] instruction unfortunately, so I negate
the whole vector and copy the last element only.
It's tricky to benchmark this little change but on average it seems to
be beneficial.
Krzysztof
libavutil
---
libavcodec/aarch64/vvc/inter.S | 73 ++
1 file changed, 20 insertions(+), 53 deletions(-)
diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S
index b65920e640..09f0627b20 100644
--- a/libavcodec/aarch64/vvc/inter.S
+++ b/libavcodec/aarc
---
libavcodec/aarch64/vvc/inter.S | 125 -
1 file changed, 122 insertions(+), 3 deletions(-)
diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S
index 0edc861f97..b65920e640 100644
--- a/libavcodec/aarch64/vvc/inter.S
+++ b/libavcodec/aarc
---
Before and after on A78
dmvr_8_12x20_neon: 86.2 ( 6.90x)
dmvr_8_20x12_neon: 94.8 ( 5.93x)
dmvr_8_20x20_neon: 141.5 ( 6.50x)
dmvr_12_12x20_neon:158.
---
This patch replaces integer widening with halving addition, and
multi-step "emulated" rounding shift with a single asm instruction doing
exactly that. This pattern repeats in other functions in this file, I
fixed some in the succeeding patch. There's a lot of performance to be
gained there.
I
This patch succesfully passes the github pipeline. The previous one,
which adds tests fails only the first check on linux x86, probably
because of that mmx issue.
The tiny patch in the second email chain (the one about right shift by
2) completes the checks chain as-is.
Krzysztof
---
libswscale
---
libswscale/aarch64/rgb2rgb.c | 16 ++
libswscale/aarch64/rgb2rgb_neon.S | 262 ++
2 files changed, 278 insertions(+)
diff --git a/libswscale/aarch64/rgb2rgb.c b/libswscale/aarch64/rgb2rgb.c
index 7e1dba572d..f474228298 100644
--- a/libswscale/aarch64/rgb2rgb.
Splitting the previous patch into two.
I noticed that on my x86 box, one of the newly added tests fail:
MMXEXT:
uyvytoyuv420_mmxext (sw_rgb.c:126)
yuyvtoyuv420_mmxext (sw_rgb.c:126)
- sw_rgb.uyvytoyuv [FAILED]
SSE2, AVX and AVX2 are passing, though.
---
tests/checkasm/sw_rgb.c |
It's a minor improvement that shaves off 5-8% from the execution time.
Instead of shifting by 2 right away and by 7 soon after, shift by 9 one
time.
Times before and after:
A78:
rgb24toyv12_16_200_neon: 5366.8 ( 3.62x)
rgb24toyv12_128_60_neon:
I forgot to include the benchmarks in the previous message, here they
are:
A78:
uyvytoyuv420_neon:6112.5 ( 6.96x)
uyvytoyuv422_neon:6696.0 ( 6.32x)
yuyvtoyuv420_neon:6113.0 ( 6.95x)
yuyvtoyu
On Mon, Feb 10, 2025 at 03:15:35PM +0200, Martin Storsjö wrote:
> > Just as I'm about to send this patch, I'm thinking if non-interleaved
> > read followed by 4 invocations of TBL wouldn't be more performant. One
> > call to generate a contiguous vector of u, second for v and two for y.
> > I'm cur
On Sat, Feb 08, 2025 at 01:59:32AM +0100, Lynne wrote:
> On 07/02/2025 20:42, Krzysztof Pyrkosz via ffmpeg-devel wrote:
> > This change removes one extra floating point operation and simplifies
> > load operations at the beginning of the loop by using dedicated register
> &
This change removes one extra floating point operation and simplifies
load operations at the beginning of the loop by using dedicated register
for each of the 5 pointers and interleaving it with calculations. The
first case seems to be a bit slower, but the performance increase is
substantial in th
The patch contains NEON code that splits the uyvy input array into 3
separate buffers.
The existing test cases are covering scenarios with odd height and odd
stride, but width is even in every instance. Is it safe to make that
assumption about the width?
Just as I'm about to send this patch, I'm
The benchmarks (before vs after) were gathered using
./tests/checkasm/checkasm --test=sw_scale --bench --runs=6 | grep yuv2yuv1
A78 before:
yuv2yuv1_0_512_accurate_c:2039.5 ( 1.00x)
yuv2yuv1_0_512_accurate_neon: 385.5 ( 5.29x)
yuv2yuv1_0_512_ap
The key idea is to pass the pre-generated tables to the TBL instruction
and churn through the data 16 bytes at a time. The remaining 4 elements
are handled with a specialized block located at the end of the routine.
The 3210 variant can be implemented using rev32, but surprisingly it is
slower tha
On Sun, Jan 26, 2025 at 01:29:38AM +0200, Martin Storsjö wrote:
> With the following diff:
>
> @@ -40,8 +41,8 @@ function ff_aac_quant_bands_neon, export=1
> moviv5.4s, 0x80, lsl #24
> .irp signed,1,0
> \signed:
> -subsw3, w3, #4
> ld1
This patch supplies handwritten NEON code for AAC.
The benchmarks below were collected by invoking these two commands on
each of my boards, A78, A72 and Thinkpad x13s:
1) ./tests/checkasm/checkasm --test=aacencdsp --bench --runs=12
2) ./ffmpeg -y -t 10:00 -f lavfi -i sine /tmp/foo.aac (the first l
On Sun, Jan 19, 2025 at 10:57:57PM +0200, Martin Storsjö wrote:
> On Sun, 19 Jan 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote:
>
> > Removed a branch, unrolled loop. Speed increase bumped from 3.95 to 5.60.
>
> On what core is that? Please quote the actual output incl
Modified the main loop to handle 8 floats in one go. A special case of
length not being multiple of 8 is handled at the beginning. The speed
increased from 3.90 to 4.50.
Krzysztof
---
libavutil/aarch64/float_dsp_neon.S | 30 ++
1 file changed, 22 insertions(+), 8 dele
Removed a branch, unrolled loop. Speed increase bumped from 3.95 to 5.60.
Krzysztof
---
libavutil/aarch64/float_dsp_neon.S | 28 +++-
1 file changed, 15 insertions(+), 13 deletions(-)
diff --git a/libavutil/aarch64/float_dsp_neon.S
b/libavutil/aarch64/float_dsp_neon.S
i
---
libavutil/aarch64/float_dsp_neon.S | 13 ++---
1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/libavutil/aarch64/float_dsp_neon.S
b/libavutil/aarch64/float_dsp_neon.S
index 35e2715b87..b21f34c084 100644
--- a/libavutil/aarch64/float_dsp_neon.S
+++ b/libavutil/aarch64/flo
Removed two redundant pointer arithmetic operations and split load
section into two smaller ones.
Speedup compared to C increased from 4.50 to 5.80.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
30 matches
Mail list logo