On Mon, 22 Aug 2022, Hubert Mazur wrote:

Add vectorized implementation of nsse16 function.

Performance comparison tests are shown below.
- nsse_0_c: 707.0
- nsse_0_neon: 120.0

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <h...@semihalf.com>
---
libavcodec/aarch64/me_cmp_init_aarch64.c |  15 +++
libavcodec/aarch64/me_cmp_neon.S         | 126 +++++++++++++++++++++++
2 files changed, 141 insertions(+)

diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index 46d4dade5d..9fe96e111c 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -889,3 +889,129 @@ function vsse_intra16_neon, export=1

        ret
endfunc
+
+function nsse16_neon, export=1
+        // x0           multiplier
+        // x1           uint8_t *pix1
+        // x2           uint8_t *pix2
+        // x3           ptrdiff_t stride
+        // w4           int h
+
+        str             x0, [sp, #-0x40]!
+        stp             x1, x2, [sp, #0x10]
+        stp             x3, x4, [sp, #0x20]
+        str             lr, [sp, #0x30]
+        bl              sse16_neon
+        ldr             lr, [sp, #0x30]

This breaks building in two configurations; old binutils doesn't recognize the register name lr, you need to spell out x30.

Building on macOS breaks since there's no symbol named sse16_neon; this is an exported function, so it has got the symbol prefix _. So you need to do "bl X(sse16_neon)" here.

Didn't look at the code from a performance perspective yet.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to