On Thu, 7 Jan 2021, Josh Dekker wrote:

Signed-off-by: Josh Dekker <j...@itanimul.li>
---
libavcodec/aarch64/Makefile               |   2 +
libavcodec/aarch64/hevcdsp_add_res_neon.S | 298 ++++++++++++++++++++++
libavcodec/aarch64/hevcdsp_init.c         |  59 +++++
libavcodec/hevcdsp.c                      |   2 +
libavcodec/hevcdsp.h                      |   1 +
5 files changed, 362 insertions(+)
create mode 100644 libavcodec/aarch64/hevcdsp_add_res_neon.S
create mode 100644 libavcodec/aarch64/hevcdsp_init.c

This one is pretty much equivalent to Reimar's patch. As his one goes on top of the ported IDCT, I think I'd prefer his version of it. But in any case, some comments below:

diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
index f6434e40da..4bdd554e7e 100644
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -17,6 +17,7 @@ OBJS-$(CONFIG_VP8DSP)                   += 
aarch64/vp8dsp_init_aarch64.o
OBJS-$(CONFIG_AAC_DECODER)              += aarch64/aacpsdsp_init_aarch64.o \
                                           aarch64/sbrdsp_init_aarch64.o
OBJS-$(CONFIG_DCA_DECODER)              += aarch64/synth_filter_init.o
+OBJS-$(CONFIG_HEVC_DECODER)             += aarch64/hevcdsp_init.o
OBJS-$(CONFIG_OPUS_DECODER)             += aarch64/opusdsp_init.o
OBJS-$(CONFIG_RV40_DECODER)             += aarch64/rv40dsp_init_aarch64.o
OBJS-$(CONFIG_VC1DSP)                   += aarch64/vc1dsp_init_aarch64.o
@@ -53,6 +54,7 @@ NEON-OBJS-$(CONFIG_VP8DSP)              += 
aarch64/vp8dsp_neon.o
# decoders/encoders
NEON-OBJS-$(CONFIG_AAC_DECODER)         += aarch64/aacpsdsp_neon.o
NEON-OBJS-$(CONFIG_DCA_DECODER)         += aarch64/synth_filter_neon.o
+NEON-OBJS-$(CONFIG_HEVC_DECODER)        += aarch64/hevcdsp_add_res_neon.o
NEON-OBJS-$(CONFIG_OPUS_DECODER)        += aarch64/opusdsp_neon.o
NEON-OBJS-$(CONFIG_VORBIS_DECODER)      += aarch64/vorbisdsp_neon.o
NEON-OBJS-$(CONFIG_VP9_DECODER)         += aarch64/vp9itxfm_16bpp_neon.o       \
diff --git a/libavcodec/aarch64/hevcdsp_add_res_neon.S 
b/libavcodec/aarch64/hevcdsp_add_res_neon.S
new file mode 100644
index 0000000000..4005366192
--- /dev/null
+++ b/libavcodec/aarch64/hevcdsp_add_res_neon.S
@@ -0,0 +1,298 @@
+/* -*-arm64-*-
+ *
+ * AArch64 NEON optimised add residual functions for HEVC decoding
+ *
+ * Copyright (c) 2020 Josh Dekker <j...@itanimul.li>

I believe this one is at least a bit inspired by the arm version, right? In that case it's probably customary to bring the original copyright along.

+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/aarch64/asm.S"
+
+.macro clip10 in1, in2, c1, c2
+    smax \in1, \in1, \c1
+    smax \in2, \in2, \c1
+    smin \in1, \in1, \c2
+    smin \in2, \in2, \c2
+.endm
+
+function ff_hevc_add_residual_4x4_8_neon, export=1
+    mov x3, x0

Please align the instructions and operands like in the existing assembly. Also, in addition to aligning those two columns, in aarch64 assembly I write myself, I also try to align the individual operands; for instructions that only take GPRs, I'd write things as "x0, x1, x2", to keep operands aligned the same for cases with registers >= x10, and for instructions that take vector registers, align everything to line up for the longest case register name, e.g. v31.16b. For SIMD loads, stores and other things, where things don't generally line up, I just try to make the code look pretty and consistent.

Also, as a matter of taste, I tend to write the lane specifications with lower case letters, i.e. .16b instead of 16B.

+    ld1 {v0.S}[0], [x3], x2
+    ld1 {v0.S}[1], [x3], x2
+    ld1 {v1.S}[0], [x3], x2
+    ld1 {v1.S}[1], [x3], x2
+    ld1 { v2.8H-v3.8H}, [x1]
+    ushll v4.8H, v0.8B, #0
+    ushll v5.8H, v1.8B, #0

ushll #0 is uxtl

+    add v6.8H, v4.8H, v2.8H
+    add v7.8H, v5.8H, v3.8H

The arm version (and Reimar's) does saturated addition here, i.e. sqadd.

There's a bunch of other minor tuning one could suggest here, but I guess it applies equally to the existing arm versions (and it's unsure whether it does provide any measurable benefit at all), so I'll refrain from suggesting it here.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to