Re: [FFmpeg-devel] [PATCH] lavc/aarch64: Add neon implementation for sse16

Martin Storsjö Wed, 03 Aug 2022 06:22:33 -0700

On Mon, 25 Jul 2022, Hubert Mazur wrote:

Provide neon implementation for sse16 function.


Performance comparison tests are shown below.
- sse_0_c: 273.0
- sse_0_neon: 48.2

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <h...@semihalf.com>
---
libavcodec/aarch64/me_cmp_init_aarch64.c |  4 ++
libavcodec/aarch64/me_cmp_neon.S         | 82 ++++++++++++++++++++++++
2 files changed, 86 insertions(+)

diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c 
b/libavcodec/aarch64/me_cmp_init_aarch64.c
index 136b008eb7..3ff5767bd0 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -30,6 +30,9 @@ int ff_pix_abs16_xy2_neon(MpegEncContext *s, uint8_t *blk1, 
uint8_t *blk2,
int ff_pix_abs16_x2_neon(MpegEncContext *v, uint8_t *pix1, uint8_t *pix2,
                      ptrdiff_t stride, int h);

+int sse16_neon(MpegEncContext *v, uint8_t *pix1, uint8_t *pix2,
+                  ptrdiff_t stride, int h);

The signature of these functions has been changed now (right after thesepatches were submitted); the pix1/pix2 parameters are now const.

Also, nitpick; please align the following line ("ptrdiff_t stride, ...")correctly with the parenthese on the line above.

+
av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
{
    int cpu_flags = av_get_cpu_flags();
@@ -40,5 +43,6 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, 
AVCodecContext *avctx)
        c->pix_abs[0][3] = ff_pix_abs16_xy2_neon;

        c->sad[0] = ff_pix_abs16_neon;
+        c->sse[0] = sse16_neon;
    }
}
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index cda7ce0408..98c912b608 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -270,3 +270,85 @@ function ff_pix_abs16_x2_neon, export=1

        ret
endfunc
+
+function sse16_neon, export=1
+        // x0 - unused
+        // x1 - pix1
+        // x2 - pix2
+        // x3 - stride
+        // w4 - h
+
+        cmp             w4, #4
+        movi            d18, #0
+        b.lt            2f
+
+// Make 4 iterations at once
+1:
+
+        // res = abs(pix1[0] - pix2[0])
+        // res * res
+
+        ld1             {v0.16b}, [x1], x3              // Load pix1 vector 
for first iteration
+        ld1             {v1.16b}, [x2], x3              // Load pix2 vector 
for first iteration
+        uabd            v30.16b, v0.16b, v1.16b         // Absolute 
difference, first iteration

Try to improve the interleaving of this function; I did a quick test onCortex A53, A72 and A73, and got these numbers:


Before:
sse_0_neon:  147.7   64.5   64.7
After:
sse_0_neon:  133.7   60.7   59.2

Overall, try to avoid having consecutive instructions operating on thesame iteration (except for when doing the same operation on differenthalves of the same iteration), i.e. not "absolute difference thirditeration; multiply lower half third iteration, multiply upper half thirditeration, pairwise add third iteration", but bundle it up so you havee.g. "absolute difference third iteration; pairwise add first iteration;multiply {upper,lower} half third iteration; pairwise add seconditeration; pairwise add third iteration", or something like that.

Then secondly, in general, don't serialize the summation down to a singleelement in each iteration! You can keep the accumulated sum as a vX.4svector (or maybe even better, two .4s vectors!) throughout the wholealgorithm, and then only add them up horizontally (with an uaddv) at theend.

For adding vectors, I would instinctively prefer doing "uaddl v0.4s,v2.4h, v3.4h; uaddl2 v1.4s, v2.8h, v3.8h" instead of "uaddlp v0.4s,v1.4h; uadalp v0.4s, v1.8h" etc.

I didn't try out this modification, but please do, I'm pretty sure it willbe a fair bit faster, and if not, at least more idiomatic SIMD.

I didn't check the other patches yet, but if the other sse* functions areimplemented similarly, I would expect the same feedback to apply to themtoo.

Let's iterate on the sse16 patch first now at least and get that onegreat, and then update sse4/sse8 similarly once we have that one settled.

I'll try to have a look at the other patches in the set latertoday/tomorrow.


// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] lavc/aarch64: Add neon implementation for sse16

Reply via email to