aarch64: add hscale specializations

Martin Storsjö Wed, 25 May 2022 01:55:13 -0700

On Wed, 25 May 2022, Martin Storsjö wrote:

On Wed, 25 May 2022, Swinney, Jonathan wrote:
This patch adds code to support specializations of the hscale function andadds
a specialization for filterSize == 4.
ff_hscale8to15_4_neon is a complete rewrite. Since the main bottleneck hereisloading the data from src, this data is loaded a whole block ahead andstoredback to the stack to be loaded again with ld4. This arranges the data formostefficient use of the vector instructions and removes the need forcompletionadds at the end. The number of iterations of the C per iteration of theassembly
is increased from 4 to 8, but because of the prefetching, there must be a
special section without prefetching when dstW < 16.
This improves speed on Graviton 2 (Neoverse N1) dramatically in the casewhere
previously fs=8 would have been required.

before: hscale_8_to_15__fs_8_dstW_512_neon: 1962.8
after : hscale_8_to_15__fs_4_dstW_512_neon: 1220.9

Signed-off-by: Jonathan Swinney <jswin...@amazon.com>
---
libswscale/aarch64/hscale.S  | 172 ++++++++++++++++++++++++++++++++++-
libswscale/aarch64/swscale.c |  40 ++++++--
libswscale/utils.c           |   2 +-
3 files changed, 203 insertions(+), 11 deletions(-)

diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S
index da34f1cb8d..60bcd783e7 100644
--- a/libswscale/aarch64/hscale.S
+++ b/libswscale/aarch64/hscale.S
@@ -1,5 +1,7 @@
/*
 * Copyright (c) 2016 Clément Bœsch <clement stupeflix.com>
+ * Copyright (c) 2019-2021 Sebastian Pop <s...@amazon.com>
+ * Copyright (c) 2022 Jonathan Swinney <jswin...@amazon.com>
 *
 * This file is part of FFmpeg.
 *
@@ -20,7 +22,25 @@

#include "libavutil/aarch64/asm.S"

-function ff_hscale_8_to_15_neon, export=1
+/*
+;-----------------------------------------------------------------------------
+; horizontal line scaling
+;
+; void hscale<source_width>to<intermediate_nbits>_<filterSize>_<opt>
+;                               (SwsContext *c, int{16,32}_t *dst,
+;                                int dstW, const uint{8,16}_t *src,
+;                                const int16_t *filter,
+; const int32_t *filterPos, intfilterSize);
+;
+; Scale one horizontal line. Input is either 8-bit width or 16-bit width
+; ($source_width can be either 8, 9, 10 or 16, difference is whether wehave to+; downscale before multiplying). Filter is 14 bits. Output is either 15bits+; (in int16_t) or 19 bits (in int32_t), as given in $intermediate_nbits.Each
+; output pixel is generated from $filterSize input pixels, the position of
+; the first pixel is given in filterPos[nOutputPixel].
+;-----------------------------------------------------------------------------*/
+
+function ff_hscale8to15_X8_neon, export=1
sbfiz x7, x6, #1, #32 // filterSize*2 (*2because int16)
1:      ldr                 w8, [x5], #4                // filterPos[idx]
ldr w0, [x5], #4 // filterPos[idx +1]
@@ -70,3 +90,153 @@ function ff_hscale_8_to_15_neon, export=1
b.gt 1b // loop until endof line
        ret
endfunc
+
+function ff_hscale8to15_4_neon, export=1
+// x0  SwsContext *c (not used)
+// x1  int16_t *dst
+// x2  int dstW
+// x3  const uint8_t *src
+// x4  const int16_t *filter
+// x5  const int32_t *filterPos
+// x6  int filterSize
+// x8-x15 registers for gathering src data
+
+// v0      madd accumulator 4S
+// v1-v4   filter values (16 bit) 8H
+// v5      madd accumulator 4S
+// v16-v19 src values (8 bit) 8B
+
+// This implementation has 4 sections:
+//  1. Prefetch src data
+//  2. Interleaved prefetching src data and madd
+//  3. Complete madd
+//  4. Complete remaining iterations when dstW % 8 != 0
+
+ add sp, sp, #-32 // allocate 32bytes on the stack+ cmp w2, #16 // if dstW <16,skip to the last block used for wrapping up
+        b.lt                2f
+
+        // load 8 values from filterPos to be used as offsets into src
+ ldp w8, w9, [x5] // filterPos[idx +0], [idx + 1]+ ldp w10, w11, [x5, 8] // filterPos[idx +2], [idx + 3]+ ldp w12, w13, [x5, 16] // filterPos[idx +4], [idx + 5]+ ldp w14, w15, [x5, 24] // filterPos[idx +6], [idx + 7]
The imediate offset here (8/16/24) must be preceded by a '#', otherwise itbreaks the build with MSVC (armasm64.exe).

Sorry - just for clarity, this of course holds for both these ldp here,but also all other ldp/stp cases in the function.


// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH v2 2/2] swscale/aarch64: add hscale specializations

Reply via email to