Re: [FFmpeg-devel] [PATCH] libavcodec/hevc: add hevc idct_4x4_neon of aarch64

Martin Storsjö Thu, 23 Feb 2023 00:55:41 -0800

On Tue, 14 Feb 2023, xufuji456 wrote:

---
libavcodec/aarch64/hevcdsp_idct_neon.S    | 51 +++++++++++++++++++++++
libavcodec/aarch64/hevcdsp_init_aarch64.c |  4 ++
2 files changed, 55 insertions(+)

This looks mostly fine, but there's a couple issues that breaks it whenbuilt with toolchains other than what you've been using. (I presume you'vebuilt it with Clang on macOS, since that's the only toolchain that builtit successfully and ran without failures.)


diff --git a/libavcodec/aarch64/hevcdsp_idct_neon.S 
b/libavcodec/aarch64/hevcdsp_idct_neon.S
index 124c50998a..fe8baf1348 100644
--- a/libavcodec/aarch64/hevcdsp_idct_neon.S
+++ b/libavcodec/aarch64/hevcdsp_idct_neon.S
@@ -245,6 +245,54 @@ function hevc_add_residual_32x32_16_neon, export=0
        ret
endfunc

+.macro tr_4x4 in0, in1, in2, in3, out0, out1, out2, out3, shift, tmp0, tmp1, 
tmp2, tmp3, tmp4
+         sshll           \tmp0, \in0, #6


Nitpick: the indentation of \tmp0 is off here

+         smull          \tmp2, \in1, v4.h[1]
+         mov            \tmp1, \tmp0

Both binutils and MS armasm64.exe refuse to assemble an instruction like"mov v0.4s, v1.4s" - for the mov instruction, they only accept .8b or.16b, not .4s - even if it doesn't make any functional difference for whatthe instruction does.

Due to how these macros are used here, it's not very easy to come up witha neat solution. One possible solution is to just repeat "sshll \tmp1,\in0, #6"; that does the same shift operation twice, but reduces the chaindepth for setting tmp1, so it's possible that it doesn't make any bigdifference in the end. (On Cortex A53/A72/A73, I didn't measure anysignificant difference here.)

+         smull          \tmp3, \in1, v4.h[3]
+         smlal          \tmp0, \in2, v4.h[0] //e0
+         smlsl          \tmp1, \in2, v4.h[0] //e1
+         smlal          \tmp2, \in3, v4.h[3] //o0
+         smlsl          \tmp3, \in3, v4.h[1] //o1
+
+         add            \tmp4, \tmp0, \tmp2
+         sub            \tmp0, \tmp0, \tmp2
+         add            \tmp2, \tmp1, \tmp3
+         sub            \tmp1, \tmp1, \tmp3
+         sqrshrn        \out0, \tmp4, #\shift
+         sqrshrn        \out3, \tmp0, #\shift
+         sqrshrn        \out1, \tmp2, #\shift
+         sqrshrn        \out2, \tmp1, #\shift
+.endm
+
+.macro transpose_4x4 r0, r1, r2, r3
+        trn1            v22.8h, \r0\().8h, \r1\().8h
+        trn2            v23.8h, \r0\().8h, \r1\().8h
+        trn1            v24.8h, \r2\().8h, \r3\().8h
+        trn2            v25.8h, \r2\().8h, \r3\().8h
+        trn1            \r0\().4s, v22.4s, v24.4s
+        trn2            \r2\().4s, v22.4s, v24.4s
+        trn1            \r1\().4s, v23.4s, v25.4s
+        trn2            \r3\().4s, v23.4s, v25.4s
+.endm

neon.S already has got a bunch of similar matching transposes. I see thatthis file already had some similar custom transposes; I senthttps://patchwork.ffmpeg.org/project/ffmpeg/patch/20230223083826.3149827-1-mar...@martin.st/to fix that. Here, you can use transpose_4x8H from neon.S too.

+
+.macro idct_4x4 bitdepth
+function ff_hevc_idct_4x4_\bitdepth\()_neon, export=1
+        ld1             {v0.4h-v3.4h}, [x0]
+
+        movrel          x1, trans
+        ld1             {v4.4h}, [x1]
+
+        tr_4x4          v0.4h, v1.4h, v2.4h, v3.4h, v16.4h, v17.4h, v18.4h, 
v19.4h, 7, v10.4s, v11.4s, v12.4s, v13.4s, v15.4s

You're not allowed to clobber v8-v15 without backing them up/restoringthem. Checkasm on macOS doesn't detect this, but it does on other OSes.

Here it should be trivial to use other temporary registers instead,without needing to backup/restore anything.


// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] libavcodec/hevc: add hevc idct_4x4_neon of aarch64

Reply via email to