checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.
vc1dsp.vc1_unescape_buffer_c: 918624.7
vc1dsp.vc1_unescape_buffer_neon: 142958.0
Signed-off-by: Ben Avison
---
libavcodec/arm/vc1dsp_init_neon.c | 61 +++
libavcodec/arm/vc1dsp_neon.S | 118
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.
vc1dsp.vc1_unescape_buffer_c: 655617.7
vc1dsp.vc1_unescape_buffer_neon: 118237.0
Signed-off-by: Ben Avison
---
libavcodec/aarch64/vc1dsp_init_aarch64.c | 61
libavcodec/aarch64/vc1dsp_neon.S | 176
Signed-off-by: Ben Avison
---
libavcodec/aarch64/Makefile | 3 +-
libavcodec/aarch64/idctdsp_init_aarch64.c | 26 +++--
libavcodec/aarch64/idctdsp_neon.S | 130 ++
3 files changed, 150 insertions(+), 9 deletions(-)
create mode 100644 libavcodec/aarch64
: 268.2
vc1dsp.vc1_inv_trans_8x8_dc_neon: 30.5
Signed-off-by: Ben Avison
---
libavcodec/aarch64/vc1dsp_init_aarch64.c | 19 +
libavcodec/aarch64/vc1dsp_neon.S | 678 +++
2 files changed, 697 insertions(+)
diff --git a/libavcodec/aarch64/vc1dsp_init_aarch64.c
b
Includes a checkasm test.
Signed-off-by: Ben Avison
---
libavcodec/vc1dec.c | 20 ++--
libavcodec/vc1dsp.c | 2 ++
libavcodec/vc1dsp.h | 3 ++
tests/checkasm/vc1dsp.c | 67 +
4 files changed, 82 insertions(+), 10 deletions(-)
diff
vc1dsp.vc1_v_loop_filter16_bestcase_neon: 103.7
vc1dsp.vc1_v_loop_filter16_worstcase_c: 646.5
vc1dsp.vc1_v_loop_filter16_worstcase_neon: 110.7
Signed-off-by: Ben Avison
---
libavcodec/arm/vc1dsp_init_neon.c | 14 +
libavcodec/arm/vc1dsp_neon.S | 643 ++
2 files
Signed-off-by: Ben Avison
---
tests/checkasm/Makefile | 1 +
tests/checkasm/checkasm.c | 3 ++
tests/checkasm/checkasm.h | 1 +
tests/checkasm/idctdsp.c | 98 +++
tests/fate/checkasm.mak | 1 +
5 files changed, 104 insertions(+)
create mode 100644
vc1dsp.vc1_v_loop_filter16_bestcase_neon: 90.0
vc1dsp.vc1_v_loop_filter16_worstcase_c: 714.2
vc1dsp.vc1_v_loop_filter16_worstcase_neon: 97.2
Signed-off-by: Ben Avison
---
libavcodec/aarch64/Makefile | 1 +
libavcodec/aarch64/vc1dsp_init_aarch64.c | 14 +
libavcodec/aarch64/vc1dsp_neon.S
at both the existing AArch32 decoder
and my new AArch64 decoder both pass.
Signed-off-by: Ben Avison
---
tests/checkasm/vc1dsp.c | 283
1 file changed, 283 insertions(+)
diff --git a/tests/checkasm/vc1dsp.c b/tests/checkasm/vc1dsp.c
index 2fd6c74d6c..7d44
these two extremes.
Signed-off-by: Ben Avison
---
tests/checkasm/Makefile | 1 +
tests/checkasm/checkasm.c | 3 ++
tests/checkasm/checkasm.h | 1 +
tests/checkasm/vc1dsp.c | 102 ++
tests/fate/checkasm.mak | 1 +
5 files changed, 108 insertions
with tighter alignment than is
encountered in normal use.
* Correct unescape buffer memcmp length.
* Update benchmarks for AArch64 idctdsp.
Ben Avison (10):
checkasm: Add vc1dsp in-loop deblocking filter tests
checkasm: Add vc1dsp inverse transform tests
checkasm: Add idctdsp add/put-p
On 30/03/2022 15:14, Martin Storsjö wrote:
On Fri, 25 Mar 2022, Ben Avison wrote:
+// Clamp 16-bit signed block coefficients to signed 8-bit (biased by
128)
+// On entry:
+// x0 -> array of 64x 16-bit coefficients
+// x1 -> 8-bit results
+// x2 = row stride for results, bytes
+fu
On 30/03/2022 14:49, Martin Storsjö wrote:
Looks generally reasonable. Is it possible to factorize out the
individual transforms (so that you'd e.g. invoke the same macro twice in
the 8x8 and 4x4 functions) without too much loss?
There is a close analogy here with the vertical/horizontal deblo
On 30/03/2022 13:35, Martin Storsjö wrote:
Overall, the code looks sensible to me.
Would it make sense to share the core of the filter between the
horizontal/vertical cases with e.g. a macro? (I didn't check in detail
if there's much differences in the core of the filter. At most some
differe
On 29/03/2022 21:37, Martin Storsjö wrote:
On Fri, 25 Mar 2022, Ben Avison wrote:
+#define
TEST_UNESCAPE
\
+ do
On 29/03/2022 14:13, Martin Storsjö wrote:
On Fri, 25 Mar 2022, Ben Avison wrote:
Disable ff_add_pixels_clamped_arm, which was found to fail the test.
I had a look at this function, and I see that the overflow checks are using
tst r6, #0x100
to see whether the
On 29/03/2022 13:44, Martin Storsjö wrote:
The existing x86 assembly for loop filters uses the stride as a
full register without clearing/sign extending the upper half
of the registers on x86_64.
This avoids crashes if the caller would have passed nonzero bits
in the previously undefined upper 3
On 25/03/2022 22:53, Martin Storsjö wrote:
On Fri, 25 Mar 2022, Ben Avison wrote:
+#define
CHECK_LOOP_FILTER(func) \
+ do
{ \
+ if (check_func(h.func, "vc1dsp.&q
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.
vc1dsp.vc1_unescape_buffer_c: 918624.7
vc1dsp.vc1_unescape_buffer_neon: 142958.0
Signed-off-by: Ben Avison
---
libavcodec/arm/vc1dsp_init_neon.c | 61 +++
libavcodec/arm/vc1dsp_neon.S | 118
: 268.2
vc1dsp.vc1_inv_trans_8x8_dc_neon: 30.5
Signed-off-by: Ben Avison
---
libavcodec/aarch64/vc1dsp_init_aarch64.c | 19 +
libavcodec/aarch64/vc1dsp_neon.S | 678 +++
2 files changed, 697 insertions(+)
diff --git a/libavcodec/aarch64/vc1dsp_init_aarch64.c
b
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.
vc1dsp.vc1_unescape_buffer_c: 655617.7
vc1dsp.vc1_unescape_buffer_neon: 118237.0
Signed-off-by: Ben Avison
---
libavcodec/aarch64/vc1dsp_init_aarch64.c | 61
libavcodec/aarch64/vc1dsp_neon.S | 176
vc1dsp.vc1_v_loop_filter16_bestcase_neon: 103.7
vc1dsp.vc1_v_loop_filter16_worstcase_c: 646.5
vc1dsp.vc1_v_loop_filter16_worstcase_neon: 110.7
Signed-off-by: Ben Avison
---
libavcodec/arm/vc1dsp_init_neon.c | 14 +
libavcodec/arm/vc1dsp_neon.S | 643 ++
2 files
Signed-off-by: Ben Avison
---
libavcodec/aarch64/Makefile | 3 +-
libavcodec/aarch64/idctdsp_init_aarch64.c | 26 +++--
libavcodec/aarch64/idctdsp_neon.S | 130 ++
3 files changed, 150 insertions(+), 9 deletions(-)
create mode 100644 libavcodec/aarch64
vc1dsp.vc1_v_loop_filter16_bestcase_neon: 90.0
vc1dsp.vc1_v_loop_filter16_worstcase_c: 714.2
vc1dsp.vc1_v_loop_filter16_worstcase_neon: 97.2
Signed-off-by: Ben Avison
---
libavcodec/aarch64/Makefile | 1 +
libavcodec/aarch64/vc1dsp_init_aarch64.c | 14 +
libavcodec/aarch64/vc1dsp_neon.S
Includes a checkasm test.
Signed-off-by: Ben Avison
---
libavcodec/vc1dec.c | 20 +++---
libavcodec/vc1dsp.c | 2 ++
libavcodec/vc1dsp.h | 3 +++
tests/checkasm/vc1dsp.c | 59 +
4 files changed, 74 insertions(+), 10 deletions
Disable ff_add_pixels_clamped_arm, which was found to fail the test. As this
is normally only used for Arms prior to Armv6 (ARM11) it seems quite unlikely
that anyone is still using this, so I haven't put in the effort to debug it.
Signed-off-by: Ben Avison
---
libavcodec/arm/idctdsp_init_
at both the existing AArch32 decoder
and my new AArch64 decoder both pass.
Signed-off-by: Ben Avison
---
tests/checkasm/vc1dsp.c | 258
1 file changed, 258 insertions(+)
diff --git a/tests/checkasm/vc1dsp.c b/tests/checkasm/vc1dsp.c
index db916d08f9..0823
these two extremes.
Signed-off-by: Ben Avison
---
tests/checkasm/Makefile | 1 +
tests/checkasm/checkasm.c | 3 ++
tests/checkasm/checkasm.h | 1 +
tests/checkasm/vc1dsp.c | 94 +++
tests/fate/checkasm.mak | 1 +
5 files changed, 100 insertions
rch64 blockdsp fast paths since it was impossible to demonstrate
that they had any appreciable effect on timings.
Ben Avison (10):
checkasm: Add vc1dsp in-loop deblocking filter tests
checkasm: Add vc1dsp inverse transform tests
checkasm: Add idctdsp add/put-pixels-clamped tests
avcodec/vc1: Intr
Hi Martin,
Thanks very much for taking a look.
On 19/03/2022 23:06, Martin Storsjö wrote:
As you are writing assembly for these functions, I would very much
appreciate if you could add checkasm tests for all the functions you're
implementing. I see that there exists a test for the blockdsp fun
On 18/03/2022 19:10, Andreas Rheinhardt wrote:
Ben Avison:
+static int vc1_unescape_buffer_neon(const uint8_t *src, int size, uint8_t *dst)
+{
+/* Dealing with starting and stopping, and removing escape bytes, are
+ * comparatively less time-sensitive, so are more clearly expressed
Signed-off-by: Ben Avison
---
libavcodec/aarch64/Makefile| 2 +
libavcodec/aarch64/blockdsp_init_aarch64.c | 42 +
libavcodec/aarch64/blockdsp_neon.S | 43 ++
libavcodec/blockdsp.c | 2 +
libavcodec/blockdsp.h
Populate with implementations suitable for 32-bit and 64-bit Arm.
Signed-off-by: Ben Avison
---
libavcodec/aarch64/vc1dsp_init_aarch64.c | 60
libavcodec/aarch64/vc1dsp_neon.S | 176 +++
libavcodec/arm/vc1dsp_init_neon.c| 60
libavcodec
Signed-off-by: Ben Avison
---
libavcodec/aarch64/Makefile | 3 +-
libavcodec/aarch64/idctdsp_init_aarch64.c | 26 +++--
libavcodec/aarch64/idctdsp_neon.S | 130 ++
3 files changed, 150 insertions(+), 9 deletions(-)
create mode 100644 libavcodec
Signed-off-by: Ben Avison
---
libavcodec/aarch64/vc1dsp_init_aarch64.c | 19 +
libavcodec/aarch64/vc1dsp_neon.S | 678 +++
2 files changed, 697 insertions(+)
diff --git a/libavcodec/aarch64/vc1dsp_init_aarch64.c
b/libavcodec/aarch64/vc1dsp_init_aarch64.c
index
Signed-off-by: Ben Avison
---
libavcodec/arm/vc1dsp_init_neon.c | 14 +
libavcodec/arm/vc1dsp_neon.S | 643 ++
2 files changed, 657 insertions(+)
diff --git a/libavcodec/arm/vc1dsp_init_neon.c
b/libavcodec/arm/vc1dsp_init_neon.c
index 2cca784f5a..f5f5c702d7
Signed-off-by: Ben Avison
---
libavcodec/aarch64/Makefile | 1 +
libavcodec/aarch64/vc1dsp_init_aarch64.c | 14 +
libavcodec/aarch64/vc1dsp_neon.S | 698 +++
3 files changed, 713 insertions(+)
create mode 100644 libavcodec/aarch64/vc1dsp_neon.S
diff
1.22x 0.82x 1.00x 0.67x
After speed: 1.31x 0.98x 1.39x 1.06x
Improvement: 7.4% 20%39%58%
`make fate` passes on both AArch32 and AArch64.
Ben Avison (6):
avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths
avcodec/vc1: Arm 32-bit
The previous implementation targeted DTS Coherent Acoustics, which only
requires nbits == 4 (fft16()). This case was (and still is) linked directly
rather than being indirected through ff_fft_calc_vfp(), but now the full
range from radix-4 up to radix-65536 is available. This benefits other codecs
I benchmarked the result by measuring the number of gperftools samples that
hit anywhere in the AAC decoder (starting from aac_decode_frame()) or
specifically in butterflies_float_c() / ff_butterflies_float_vfp() for the
same sample AAC stream:
Before After
I benchmarked the result by measuring the number of gperftools samples that
hit anywhere in the AAC decoder (starting from aac_decode_frame()) or
specifically in vector_fmul_window_c() / ff_vector_fmul_window_vfp() for the
same sample AAC stream:
Before After
The previous implementation targeted DTS Coherent Acoustics, which only
requires nbits == 4 (fft16()). This case was (and still is) linked directly
rather than being indirected through ff_fft_calc_vfp(), but now the full
range from radix-4 up to radix-65536 is available. This benefits other codecs
The previous implementation targeted DTS Coherent Acoustics, which only
requires mdct_bits == 6. This relatively small size lent itself to
unrolling the loops a small number of times, and encoding offsets
calculated at assembly time within the load/store instructions of each
iteration.
In the more
43 matches
Mail list logo