Le 5 juin 2024 09:53:45 GMT+03:00, Zhao Zhili <quinkbl...@foxmail.com> a écrit : > > >> On Jun 5, 2024, at 14:29, Rémi Denis-Courmont <r...@remlab.net> wrote: >> >> >> >> Le 4 juin 2024 16:55:01 GMT+03:00, Zhao Zhili <quinkbl...@foxmail.com >> <mailto:quinkbl...@foxmail.com>> a écrit : >>> From: Zhao Zhili <zhiliz...@tencent.com> >>> >>> Test on Apple M1: >>> >>> rgb24_to_uv_1080_c: 7.2 >>> rgb24_to_uv_1080_neon: 5.5 >>> rgb24_to_uv_1280_c: 8.2 >>> rgb24_to_uv_1280_neon: 6.2 >>> rgb24_to_uv_1920_c: 12.5 >>> rgb24_to_uv_1920_neon: 9.5 >>> >>> rgb24_to_uv_half_540_c: 6.5 >>> rgb24_to_uv_half_540_neon: 3.0 >>> rgb24_to_uv_half_640_c: 7.5 >>> rgb24_to_uv_half_640_neon: 3.2 >>> rgb24_to_uv_half_960_c: 12.5 >>> rgb24_to_uv_half_960_neon: 6.0 >>> >>> rgb24_to_y_1080_c: 4.5 >>> rgb24_to_y_1080_neon: 3.5 >>> rgb24_to_y_1280_c: 5.2 >>> rgb24_to_y_1280_neon: 4.2 >>> rgb24_to_y_1920_c: 8.0 >>> rgb24_to_y_1920_neon: 6.0 >>> >>> Signed-off-by: Zhao Zhili <zhiliz...@tencent.com> >>> --- >>> libswscale/aarch64/Makefile | 1 + >>> libswscale/aarch64/input.S | 229 +++++++++++++++++++++++++++++++++++ >>> libswscale/aarch64/swscale.c | 25 ++++ >>> 3 files changed, 255 insertions(+) >>> create mode 100644 libswscale/aarch64/input.S >>> >>> diff --git a/libswscale/aarch64/Makefile b/libswscale/aarch64/Makefile >>> index da1d909561..adfd90a1b6 100644 >>> --- a/libswscale/aarch64/Makefile >>> +++ b/libswscale/aarch64/Makefile >>> @@ -3,6 +3,7 @@ OBJS += aarch64/rgb2rgb.o \ >>> aarch64/swscale_unscaled.o \ >>> >>> NEON-OBJS += aarch64/hscale.o \ >>> + aarch64/input.o \ >>> aarch64/output.o \ >>> aarch64/rgb2rgb_neon.o \ >>> aarch64/yuv2rgb_neon.o \ >>> diff --git a/libswscale/aarch64/input.S b/libswscale/aarch64/input.S >>> new file mode 100644 >>> index 0000000000..ee0d223c6e >>> --- /dev/null >>> +++ b/libswscale/aarch64/input.S >>> @@ -0,0 +1,229 @@ >>> +/* >>> + * Copyright (c) 2024 Zhao Zhili <quinkbl...@foxmail.com> >>> + * >>> + * This file is part of FFmpeg. >>> + * >>> + * FFmpeg is free software; you can redistribute it and/or >>> + * modify it under the terms of the GNU Lesser General Public >>> + * License as published by the Free Software Foundation; either >>> + * version 2.1 of the License, or (at your option) any later version. >>> + * >>> + * FFmpeg is distributed in the hope that it will be useful, >>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of >>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU >>> + * Lesser General Public License for more details. >>> + * >>> + * You should have received a copy of the GNU Lesser General Public >>> + * License along with FFmpeg; if not, write to the Free Software >>> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA >>> 02110-1301 USA >>> + */ >>> + >>> +#include "libavutil/aarch64/asm.S" >>> + >>> +.macro rgb24_to_yuv_load_rgb, src >>> + ld3 { v16.16b, v17.16b, v18.16b }, [\src] >>> + ushll v19.8h, v16.8b, #0 // v19: r >>> + ushll v20.8h, v17.8b, #0 // v20: g >>> + ushll v21.8h, v18.8b, #0 // v21: b >>> + ushll2 v22.8h, v16.16b, #0 // v22: r >>> + ushll2 v23.8h, v17.16b, #0 // v23: g >>> + ushll2 v24.8h, v18.16b, #0 // v24: b >>> +.endm >>> + >>> +.macro rgb24_to_yuv_product, r, g, b, dst1, dst2, dst, coef0, coef1, >>> coef2, right_shift >>> + mov \dst1\().16b, v6.16b // dst1 = >>> const_offset >>> + mov \dst2\().16b, v6.16b // dst2 = >>> const_offset >>> + smlal \dst1\().4s, \coef0\().4h, \r\().4h // dst1 += >>> rx * r >>> + smlal2 \dst2\().4s, \coef0\().8h, \r\().8h // dst2 += >>> rx * r >>> + smlal \dst1\().4s, \coef1\().4h, \g\().4h // dst1 += >>> gx * g >>> + smlal2 \dst2\().4s, \coef1\().8h, \g\().8h // dst2 += >>> gx * g >>> + smlal \dst1\().4s, \coef2\().4h, \b\().4h // dst1 += >>> bx * b >>> + smlal2 \dst2\().4s, \coef2\().8h, \b\().8h // dst2 += >>> bx * b >>> + sqshrn \dst\().4h, \dst1\().4s, \right_shift // >>> dst_lower_half = dst1 >> right_shift >>> + sqshrn2 \dst\().8h, \dst2\().4s, \right_shift // >>> dst_higher_half = dst2 >> right_shift >>> +.endm >>> + >>> +function ff_rgb24ToY_neon, export=1 >>> + cmp w4, #0 // check width > 0 >>> + b.le 4f >>> + >>> + ldp w10, w11, [x5], #8 // w10: ry, w11: gy >> >> I don't think it affects anything on your OoO execution hardware, but you're >> using the result of this load right off the bat in the next instruction. >> Ditto below. This may hurt perfs on not-so-fancy CPUs. > >Will do. > >> >>> + dup v0.8h, w10 >>> + dup v1.8h, w11 >>> + ldr w12, [x5] // w12: by >>> + dup v2.8h, w12 >>> + >>> + mov w9, #256 // w9 = 1 << >>> (RGB2YUV_SHIFT - 7) >>> + movk w9, #8, lsl #16 // w9 += 32 << >>> (RGB2YUV_SHIFT - 1) >>> + dup v6.4s, w9 // w9: const_offset >>> + >>> + mov x2, #0 // w2: i >>> + and w3, w4, #0xFFFFFFF0 // w3 = width / 16 * 16 >>> + cbz w3, 3f >>> +1: >>> + rgb24_to_yuv_load_rgb x1 >>> + rgb24_to_yuv_product v19, v20, v21, v25, v26, v16, v0, v1, v2, #9 >>> + rgb24_to_yuv_product v22, v23, v24, v27, v28, v17, v0, v1, v2, #9 >>> + stp q16, q17, [x0], #32 // store to dst >>> + >>> + add w2, w2, #16 // i += 16 >>> + add x1, x1, #48 // src += 48 >>> + cmp w2, w3 // i < (width / 16 * 16) >>> + b.lt 1b >>> + b 3f >>> +2: >>> + ldrb w13, [x1] // w13: r >>> + ldrb w14, [x1, #1] // w14: g >>> + ldrb w15, [x1, #2] // w15: b >> >> You can reorder instructions a little to use post-index and eliminate the >> ADD, though that won't make much difference. >> >> I don't get why the perf gain is so low, or is this an artefact of Apple >> CPUs? > >I have checked the assembly of C version. The compiler has done pretty well on >loop unroll and >vectorize on this simple case.
Uh, don't we disable auto-vectorisation in the configure script? Until/unless it is re-enabled, I think benchmarks should be done against non-auto-vectorised code, if only to stay representative of normal/default FFmpeg builds. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".