[FFmpeg-devel] [PATCH v3 10/10] avcodec/vc1: Arm 32-bit NEON unescape fast path

2022-03-31 Thread Ben Avison
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. vc1dsp.vc1_unescape_buffer_c: 918624.7 vc1dsp.vc1_unescape_buffer_neon: 142958.0 Signed-off-by: Ben Avison --- libavcodec/arm/vc1dsp_init_neon.c | 61 +++ libavcodec/arm/vc1dsp_neon.S | 118

[FFmpeg-devel] [PATCH v3 09/10] avcodec/vc1: Arm 64-bit NEON unescape fast path

2022-03-31 Thread Ben Avison
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. vc1dsp.vc1_unescape_buffer_c: 655617.7 vc1dsp.vc1_unescape_buffer_neon: 118237.0 Signed-off-by: Ben Avison --- libavcodec/aarch64/vc1dsp_init_aarch64.c | 61 libavcodec/aarch64/vc1dsp_neon.S | 176

[FFmpeg-devel] [PATCH v3 08/10] avcodec/idctdsp: Arm 64-bit NEON block add and clamp fast paths

2022-03-31 Thread Ben Avison
Signed-off-by: Ben Avison --- libavcodec/aarch64/Makefile | 3 +- libavcodec/aarch64/idctdsp_init_aarch64.c | 26 +++-- libavcodec/aarch64/idctdsp_neon.S | 130 ++ 3 files changed, 150 insertions(+), 9 deletions(-) create mode 100644 libavcodec/aarch64

[FFmpeg-devel] [PATCH v3 07/10] avcodec/vc1: Arm 64-bit NEON inverse transform fast paths

2022-03-31 Thread Ben Avison
: 268.2 vc1dsp.vc1_inv_trans_8x8_dc_neon: 30.5 Signed-off-by: Ben Avison --- libavcodec/aarch64/vc1dsp_init_aarch64.c | 19 + libavcodec/aarch64/vc1dsp_neon.S | 678 +++ 2 files changed, 697 insertions(+) diff --git a/libavcodec/aarch64/vc1dsp_init_aarch64.c b

[FFmpeg-devel] [PATCH v3 04/10] avcodec/vc1: Introduce fast path for unescaping bitstream buffer

2022-03-31 Thread Ben Avison
Includes a checkasm test. Signed-off-by: Ben Avison --- libavcodec/vc1dec.c | 20 ++-- libavcodec/vc1dsp.c | 2 ++ libavcodec/vc1dsp.h | 3 ++ tests/checkasm/vc1dsp.c | 67 + 4 files changed, 82 insertions(+), 10 deletions(-) diff

[FFmpeg-devel] [PATCH v3 06/10] avcodec/vc1: Arm 32-bit NEON deblocking filter fast paths

2022-03-31 Thread Ben Avison
vc1dsp.vc1_v_loop_filter16_bestcase_neon: 103.7 vc1dsp.vc1_v_loop_filter16_worstcase_c: 646.5 vc1dsp.vc1_v_loop_filter16_worstcase_neon: 110.7 Signed-off-by: Ben Avison --- libavcodec/arm/vc1dsp_init_neon.c | 14 + libavcodec/arm/vc1dsp_neon.S | 643 ++ 2 files

[FFmpeg-devel] [PATCH v3 03/10] checkasm: Add idctdsp add/put-pixels-clamped tests

2022-03-31 Thread Ben Avison
Signed-off-by: Ben Avison --- tests/checkasm/Makefile | 1 + tests/checkasm/checkasm.c | 3 ++ tests/checkasm/checkasm.h | 1 + tests/checkasm/idctdsp.c | 98 +++ tests/fate/checkasm.mak | 1 + 5 files changed, 104 insertions(+) create mode 100644

[FFmpeg-devel] [PATCH v3 05/10] avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths

2022-03-31 Thread Ben Avison
vc1dsp.vc1_v_loop_filter16_bestcase_neon: 90.0 vc1dsp.vc1_v_loop_filter16_worstcase_c: 714.2 vc1dsp.vc1_v_loop_filter16_worstcase_neon: 97.2 Signed-off-by: Ben Avison --- libavcodec/aarch64/Makefile | 1 + libavcodec/aarch64/vc1dsp_init_aarch64.c | 14 + libavcodec/aarch64/vc1dsp_neon.S

[FFmpeg-devel] [PATCH v3 02/10] checkasm: Add vc1dsp inverse transform tests

2022-03-31 Thread Ben Avison
at both the existing AArch32 decoder and my new AArch64 decoder both pass. Signed-off-by: Ben Avison --- tests/checkasm/vc1dsp.c | 283 1 file changed, 283 insertions(+) diff --git a/tests/checkasm/vc1dsp.c b/tests/checkasm/vc1dsp.c index 2fd6c74d6c..7d44

[FFmpeg-devel] [PATCH v3 01/10] checkasm: Add vc1dsp in-loop deblocking filter tests

2022-03-31 Thread Ben Avison
these two extremes. Signed-off-by: Ben Avison --- tests/checkasm/Makefile | 1 + tests/checkasm/checkasm.c | 3 ++ tests/checkasm/checkasm.h | 1 + tests/checkasm/vc1dsp.c | 102 ++ tests/fate/checkasm.mak | 1 + 5 files changed, 108 insertions

[FFmpeg-devel] [PATCH v3 00/10] avcodec/vc1: Arm optimisations

2022-03-31 Thread Ben Avison
with tighter alignment than is encountered in normal use. * Correct unescape buffer memcmp length. * Update benchmarks for AArch64 idctdsp. Ben Avison (10): checkasm: Add vc1dsp in-loop deblocking filter tests checkasm: Add vc1dsp inverse transform tests checkasm: Add idctdsp add/put-p

Re: [FFmpeg-devel] [PATCH 08/10] avcodec/idctdsp: Arm 64-bit NEON block add and clamp fast paths

2022-03-31 Thread Ben Avison
On 30/03/2022 15:14, Martin Storsjö wrote: On Fri, 25 Mar 2022, Ben Avison wrote: +// Clamp 16-bit signed block coefficients to signed 8-bit (biased by 128) +// On entry: +//   x0 -> array of 64x 16-bit coefficients +//   x1 -> 8-bit results +//   x2 = row stride for results, bytes +fu

Re: [FFmpeg-devel] [PATCH 07/10] avcodec/vc1: Arm 64-bit NEON inverse transform fast paths

2022-03-31 Thread Ben Avison
On 30/03/2022 14:49, Martin Storsjö wrote: Looks generally reasonable. Is it possible to factorize out the individual transforms (so that you'd e.g. invoke the same macro twice in the 8x8 and 4x4 functions) without too much loss? There is a close analogy here with the vertical/horizontal deblo

Re: [FFmpeg-devel] [PATCH 05/10] avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths

2022-03-31 Thread Ben Avison
On 30/03/2022 13:35, Martin Storsjö wrote: Overall, the code looks sensible to me. Would it make sense to share the core of the filter between the horizontal/vertical cases with e.g. a macro? (I didn't check in detail if there's much differences in the core of the filter. At most some differe

Re: [FFmpeg-devel] [PATCH 04/10] avcodec/vc1: Introduce fast path for unescaping bitstream buffer

2022-03-31 Thread Ben Avison
On 29/03/2022 21:37, Martin Storsjö wrote: On Fri, 25 Mar 2022, Ben Avison wrote: +#define TEST_UNESCAPE \ +    do

Re: [FFmpeg-devel] [PATCH 03/10] checkasm: Add idctdsp add/put-pixels-clamped tests

2022-03-29 Thread Ben Avison
On 29/03/2022 14:13, Martin Storsjö wrote: On Fri, 25 Mar 2022, Ben Avison wrote: Disable ff_add_pixels_clamped_arm, which was found to fail the test. I had a look at this function, and I see that the overflow checks are using     tst r6,  #0x100 to see whether the

Re: [FFmpeg-devel] [PATCH] vc1dsp: Change remaining stride parameters to ptrdiff_t

2022-03-29 Thread Ben Avison
On 29/03/2022 13:44, Martin Storsjö wrote: The existing x86 assembly for loop filters uses the stride as a full register without clearing/sign extending the upper half of the registers on x86_64. This avoids crashes if the caller would have passed nonzero bits in the previously undefined upper 3

Re: [FFmpeg-devel] [PATCH 01/10] checkasm: Add vc1dsp in-loop deblocking filter tests

2022-03-28 Thread Ben Avison
On 25/03/2022 22:53, Martin Storsjö wrote: On Fri, 25 Mar 2022, Ben Avison wrote: +#define CHECK_LOOP_FILTER(func) \ +    do {    \ +    if (check_func(h.func, "vc1dsp.&q

[FFmpeg-devel] [PATCH 10/10] avcodec/vc1: Arm 32-bit NEON unescape fast path

2022-03-25 Thread Ben Avison
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. vc1dsp.vc1_unescape_buffer_c: 918624.7 vc1dsp.vc1_unescape_buffer_neon: 142958.0 Signed-off-by: Ben Avison --- libavcodec/arm/vc1dsp_init_neon.c | 61 +++ libavcodec/arm/vc1dsp_neon.S | 118

[FFmpeg-devel] [PATCH 07/10] avcodec/vc1: Arm 64-bit NEON inverse transform fast paths

2022-03-25 Thread Ben Avison
: 268.2 vc1dsp.vc1_inv_trans_8x8_dc_neon: 30.5 Signed-off-by: Ben Avison --- libavcodec/aarch64/vc1dsp_init_aarch64.c | 19 + libavcodec/aarch64/vc1dsp_neon.S | 678 +++ 2 files changed, 697 insertions(+) diff --git a/libavcodec/aarch64/vc1dsp_init_aarch64.c b

[FFmpeg-devel] [PATCH 09/10] avcodec/vc1: Arm 64-bit NEON unescape fast path

2022-03-25 Thread Ben Avison
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. vc1dsp.vc1_unescape_buffer_c: 655617.7 vc1dsp.vc1_unescape_buffer_neon: 118237.0 Signed-off-by: Ben Avison --- libavcodec/aarch64/vc1dsp_init_aarch64.c | 61 libavcodec/aarch64/vc1dsp_neon.S | 176

[FFmpeg-devel] [PATCH 06/10] avcodec/vc1: Arm 32-bit NEON deblocking filter fast paths

2022-03-25 Thread Ben Avison
vc1dsp.vc1_v_loop_filter16_bestcase_neon: 103.7 vc1dsp.vc1_v_loop_filter16_worstcase_c: 646.5 vc1dsp.vc1_v_loop_filter16_worstcase_neon: 110.7 Signed-off-by: Ben Avison --- libavcodec/arm/vc1dsp_init_neon.c | 14 + libavcodec/arm/vc1dsp_neon.S | 643 ++ 2 files

[FFmpeg-devel] [PATCH 08/10] avcodec/idctdsp: Arm 64-bit NEON block add and clamp fast paths

2022-03-25 Thread Ben Avison
Signed-off-by: Ben Avison --- libavcodec/aarch64/Makefile | 3 +- libavcodec/aarch64/idctdsp_init_aarch64.c | 26 +++-- libavcodec/aarch64/idctdsp_neon.S | 130 ++ 3 files changed, 150 insertions(+), 9 deletions(-) create mode 100644 libavcodec/aarch64

[FFmpeg-devel] [PATCH 05/10] avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths

2022-03-25 Thread Ben Avison
vc1dsp.vc1_v_loop_filter16_bestcase_neon: 90.0 vc1dsp.vc1_v_loop_filter16_worstcase_c: 714.2 vc1dsp.vc1_v_loop_filter16_worstcase_neon: 97.2 Signed-off-by: Ben Avison --- libavcodec/aarch64/Makefile | 1 + libavcodec/aarch64/vc1dsp_init_aarch64.c | 14 + libavcodec/aarch64/vc1dsp_neon.S

[FFmpeg-devel] [PATCH 04/10] avcodec/vc1: Introduce fast path for unescaping bitstream buffer

2022-03-25 Thread Ben Avison
Includes a checkasm test. Signed-off-by: Ben Avison --- libavcodec/vc1dec.c | 20 +++--- libavcodec/vc1dsp.c | 2 ++ libavcodec/vc1dsp.h | 3 +++ tests/checkasm/vc1dsp.c | 59 + 4 files changed, 74 insertions(+), 10 deletions

[FFmpeg-devel] [PATCH 03/10] checkasm: Add idctdsp add/put-pixels-clamped tests

2022-03-25 Thread Ben Avison
Disable ff_add_pixels_clamped_arm, which was found to fail the test. As this is normally only used for Arms prior to Armv6 (ARM11) it seems quite unlikely that anyone is still using this, so I haven't put in the effort to debug it. Signed-off-by: Ben Avison --- libavcodec/arm/idctdsp_init_

[FFmpeg-devel] [PATCH 02/10] checkasm: Add vc1dsp inverse transform tests

2022-03-25 Thread Ben Avison
at both the existing AArch32 decoder and my new AArch64 decoder both pass. Signed-off-by: Ben Avison --- tests/checkasm/vc1dsp.c | 258 1 file changed, 258 insertions(+) diff --git a/tests/checkasm/vc1dsp.c b/tests/checkasm/vc1dsp.c index db916d08f9..0823

[FFmpeg-devel] [PATCH 01/10] checkasm: Add vc1dsp in-loop deblocking filter tests

2022-03-25 Thread Ben Avison
these two extremes. Signed-off-by: Ben Avison --- tests/checkasm/Makefile | 1 + tests/checkasm/checkasm.c | 3 ++ tests/checkasm/checkasm.h | 1 + tests/checkasm/vc1dsp.c | 94 +++ tests/fate/checkasm.mak | 1 + 5 files changed, 100 insertions

[FFmpeg-devel] [PATCH v2 00/10] avcodec/vc1: Arm optimisations

2022-03-25 Thread Ben Avison
rch64 blockdsp fast paths since it was impossible to demonstrate that they had any appreciable effect on timings. Ben Avison (10): checkasm: Add vc1dsp in-loop deblocking filter tests checkasm: Add vc1dsp inverse transform tests checkasm: Add idctdsp add/put-pixels-clamped tests avcodec/vc1: Intr

Re: [FFmpeg-devel] [PATCH 0/6] avcodec/vc1: Arm optimisations

2022-03-21 Thread Ben Avison
Hi Martin, Thanks very much for taking a look. On 19/03/2022 23:06, Martin Storsjö wrote: As you are writing assembly for these functions, I would very much appreciate if you could add checkasm tests for all the functions you're implementing. I see that there exists a test for the blockdsp fun

Re: [FFmpeg-devel] [PATCH 6/6] avcodec/vc1: Introduce fast path for unescaping bitstream buffer

2022-03-21 Thread Ben Avison
On 18/03/2022 19:10, Andreas Rheinhardt wrote: Ben Avison: +static int vc1_unescape_buffer_neon(const uint8_t *src, int size, uint8_t *dst) +{ +/* Dealing with starting and stopping, and removing escape bytes, are + * comparatively less time-sensitive, so are more clearly expressed

[FFmpeg-devel] [PATCH 5/6] avcodec/blockdsp: Arm 64-bit NEON block clear fast paths

2022-03-17 Thread Ben Avison
Signed-off-by: Ben Avison --- libavcodec/aarch64/Makefile| 2 + libavcodec/aarch64/blockdsp_init_aarch64.c | 42 + libavcodec/aarch64/blockdsp_neon.S | 43 ++ libavcodec/blockdsp.c | 2 + libavcodec/blockdsp.h

[FFmpeg-devel] [PATCH 6/6] avcodec/vc1: Introduce fast path for unescaping bitstream buffer

2022-03-17 Thread Ben Avison
Populate with implementations suitable for 32-bit and 64-bit Arm. Signed-off-by: Ben Avison --- libavcodec/aarch64/vc1dsp_init_aarch64.c | 60 libavcodec/aarch64/vc1dsp_neon.S | 176 +++ libavcodec/arm/vc1dsp_init_neon.c| 60 libavcodec

[FFmpeg-devel] [PATCH 4/6] avcodec/idctdsp: Arm 64-bit NEON block add and clamp fast paths

2022-03-17 Thread Ben Avison
Signed-off-by: Ben Avison --- libavcodec/aarch64/Makefile | 3 +- libavcodec/aarch64/idctdsp_init_aarch64.c | 26 +++-- libavcodec/aarch64/idctdsp_neon.S | 130 ++ 3 files changed, 150 insertions(+), 9 deletions(-) create mode 100644 libavcodec

[FFmpeg-devel] [PATCH 3/6] avcodec/vc1: Arm 64-bit NEON inverse transform fast paths

2022-03-17 Thread Ben Avison
Signed-off-by: Ben Avison --- libavcodec/aarch64/vc1dsp_init_aarch64.c | 19 + libavcodec/aarch64/vc1dsp_neon.S | 678 +++ 2 files changed, 697 insertions(+) diff --git a/libavcodec/aarch64/vc1dsp_init_aarch64.c b/libavcodec/aarch64/vc1dsp_init_aarch64.c index

[FFmpeg-devel] [PATCH 2/6] avcodec/vc1: Arm 32-bit NEON deblocking filter fast paths

2022-03-17 Thread Ben Avison
Signed-off-by: Ben Avison --- libavcodec/arm/vc1dsp_init_neon.c | 14 + libavcodec/arm/vc1dsp_neon.S | 643 ++ 2 files changed, 657 insertions(+) diff --git a/libavcodec/arm/vc1dsp_init_neon.c b/libavcodec/arm/vc1dsp_init_neon.c index 2cca784f5a..f5f5c702d7

[FFmpeg-devel] [PATCH 1/6] avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths

2022-03-17 Thread Ben Avison
Signed-off-by: Ben Avison --- libavcodec/aarch64/Makefile | 1 + libavcodec/aarch64/vc1dsp_init_aarch64.c | 14 + libavcodec/aarch64/vc1dsp_neon.S | 698 +++ 3 files changed, 713 insertions(+) create mode 100644 libavcodec/aarch64/vc1dsp_neon.S diff

[FFmpeg-devel] [PATCH 0/6] avcodec/vc1: Arm optimisations

2022-03-17 Thread Ben Avison
1.22x 0.82x 1.00x 0.67x After speed: 1.31x 0.98x 1.39x 1.06x Improvement: 7.4% 20%39%58% `make fate` passes on both AArch32 and AArch64. Ben Avison (6): avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths avcodec/vc1: Arm 32-bit

[FFmpeg-devel] [Updated PATCH 2/4] armv6: Accelerate ff_fft_calc for general case (nbits != 4)

2014-07-11 Thread Ben Avison
The previous implementation targeted DTS Coherent Acoustics, which only requires nbits == 4 (fft16()). This case was (and still is) linked directly rather than being indirected through ff_fft_calc_vfp(), but now the full range from radix-4 up to radix-65536 is available. This benefits other codecs

[FFmpeg-devel] [PATCH 4/4] armv6: Accelerate butterflies_float

2014-07-10 Thread Ben Avison
I benchmarked the result by measuring the number of gperftools samples that hit anywhere in the AAC decoder (starting from aac_decode_frame()) or specifically in butterflies_float_c() / ff_butterflies_float_vfp() for the same sample AAC stream: Before After

[FFmpeg-devel] [PATCH 3/4] armv6: Accelerate vector_fmul_window

2014-07-10 Thread Ben Avison
I benchmarked the result by measuring the number of gperftools samples that hit anywhere in the AAC decoder (starting from aac_decode_frame()) or specifically in vector_fmul_window_c() / ff_vector_fmul_window_vfp() for the same sample AAC stream: Before After

[FFmpeg-devel] [PATCH 2/4] armv6: Accelerate ff_fft_calc for general case (nbits != 4)

2014-07-10 Thread Ben Avison
The previous implementation targeted DTS Coherent Acoustics, which only requires nbits == 4 (fft16()). This case was (and still is) linked directly rather than being indirected through ff_fft_calc_vfp(), but now the full range from radix-4 up to radix-65536 is available. This benefits other codecs

[FFmpeg-devel] [PATCH 1/4] armv6: Accelerate ff_imdct_half for general case (mdct_bits != 6)

2014-07-10 Thread Ben Avison
The previous implementation targeted DTS Coherent Acoustics, which only requires mdct_bits == 6. This relatively small size lent itself to unrolling the loops a small number of times, and encoding offsets calculated at assembly time within the load/store instructions of each iteration. In the more