PR #22385 opened by Zhanheng.Yang
URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/22385
Patch URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/22385.patch
Original vle + vslideup way to compute horizontal filtering would cause vreg
dependency, which make poor performance. This patch use multi-vle instead.
Bench on A210 C908 core(VLEN 128):
ori opt new opt
put_h264_qpel_4_mc10_8_c: 119.7 ( 1.00x) 126.3 ( 1.00x)
put_h264_qpel_4_mc10_8_rvv_i32: 102.0 ( 1.17x) 76.5 ( 1.65x)
put_h264_qpel_4_mc20_8_c: 100.7 ( 1.00x) 96.1 ( 1.00x)
put_h264_qpel_4_mc20_8_rvv_i32: 95.8 ( 1.05x) 70.6 ( 1.36x)
put_h264_qpel_4_mc30_8_c: 119.6 ( 1.00x) 123.6 ( 1.00x)
put_h264_qpel_4_mc30_8_rvv_i32: 101.6 ( 1.18x) 75.0 ( 1.65x)
put_h264_qpel_8_mc10_8_c: 484.9 ( 1.00x) 475.2 ( 1.00x)
put_h264_qpel_8_mc10_8_rvv_i32: 199.1 ( 2.44x) 144.6 ( 3.29x)
put_h264_qpel_8_mc20_8_c: 353.6 ( 1.00x) 360.4 ( 1.00x)
put_h264_qpel_8_mc20_8_rvv_i32: 187.2 ( 1.89x) 133.5 ( 2.70x)
put_h264_qpel_8_mc30_8_c: 477.4 ( 1.00x) 448.4 ( 1.00x)
put_h264_qpel_8_mc30_8_rvv_i32: 199.0 ( 2.40x) 144.7 ( 3.10x)
put_h264_qpel_16_mc10_8_c: 1908.4 ( 1.00x) 1961.3 ( 1.00x)
put_h264_qpel_16_mc10_8_rvv_i32: 432.8 ( 4.41x) 351.2 ( 5.59x)
put_h264_qpel_16_mc20_8_c: 1459.0 ( 1.00x) 1446.4 ( 1.00x)
put_h264_qpel_16_mc20_8_rvv_i32: 403.1 ( 3.62x) 320.6 ( 4.51x)
put_h264_qpel_16_mc30_8_c: 1935.7 ( 1.00x) 1916.8 ( 1.00x)
put_h264_qpel_16_mc30_8_rvv_i32: 435.0 ( 4.45x) 353.0 ( 5.43x)
avg_h264_qpel_4_mc10_8_c: 133.6 ( 1.00x) 129.2 ( 1.00x)
avg_h264_qpel_4_mc10_8_rvv_i32: 105.7 ( 1.26x) 80.0 ( 1.62x)
avg_h264_qpel_4_mc20_8_c: 114.7 ( 1.00x) 122.8 ( 1.00x)
avg_h264_qpel_4_mc20_8_rvv_i32: 99.5 ( 1.15x) 73.1 ( 1.68x)
avg_h264_qpel_4_mc30_8_c: 128.0 ( 1.00x) 127.9 ( 1.00x)
avg_h264_qpel_4_mc30_8_rvv_i32: 105.3 ( 1.22x) 79.6 ( 1.61x)
avg_h264_qpel_8_mc10_8_c: 505.2 ( 1.00x) 494.9 ( 1.00x)
avg_h264_qpel_8_mc10_8_rvv_i32: 207.2 ( 2.44x) 152.2 ( 3.25x)
avg_h264_qpel_8_mc20_8_c: 421.0 ( 1.00x) 422.4 ( 1.00x)
avg_h264_qpel_8_mc20_8_rvv_i32: 195.6 ( 2.15x) 140.3 ( 3.01x)
avg_h264_qpel_8_mc30_8_c: 479.6 ( 1.00x) 500.0 ( 1.00x)
avg_h264_qpel_8_mc30_8_rvv_i32: 208.8 ( 2.30x) 153.8 ( 3.25x)
avg_h264_qpel_16_mc10_8_c: 2006.0 ( 1.00x) 2011.4 ( 1.00x)
avg_h264_qpel_16_mc10_8_rvv_i32: 462.5 ( 4.34x) 378.9 ( 5.31x)
avg_h264_qpel_16_mc20_8_c: 1761.2 ( 1.00x) 1749.9 ( 1.00x)
avg_h264_qpel_16_mc20_8_rvv_i32: 431.7 ( 4.08x) 348.0 ( 5.03x)
avg_h264_qpel_16_mc30_8_c: 1950.7 ( 1.00x) 1980.4 ( 1.00x)
avg_h264_qpel_16_mc30_8_rvv_i32: 464.4 ( 4.20x) 380.1 ( 5.21x)
Signed-off-by: zhanheng.yang <[email protected]>
>From c40dc90a3de5eb7d263a6e38a38e7d661d4a8800 Mon Sep 17 00:00:00 2001
From: "[email protected]" <[email protected]>
Date: Tue, 20 Jan 2026 19:10:23 +0800
Subject: [PATCH] libavcodec/riscv/h264qpel: use multi-vle instead of vle +
vslideup in macro lowpass_h.
Original vle + vslideup way to compute horizontal filtering would cause vreg
dependency, which make poor performance. This patch use multi-vle instead.
Bench on A210 C908 core(VLEN 128):
ori opt new opt
put_h264_qpel_4_mc10_8_c: 119.7 ( 1.00x) 126.3 ( 1.00x)
put_h264_qpel_4_mc10_8_rvv_i32: 102.0 ( 1.17x) 76.5 ( 1.65x)
put_h264_qpel_4_mc20_8_c: 100.7 ( 1.00x) 96.1 ( 1.00x)
put_h264_qpel_4_mc20_8_rvv_i32: 95.8 ( 1.05x) 70.6 ( 1.36x)
put_h264_qpel_4_mc30_8_c: 119.6 ( 1.00x) 123.6 ( 1.00x)
put_h264_qpel_4_mc30_8_rvv_i32: 101.6 ( 1.18x) 75.0 ( 1.65x)
put_h264_qpel_8_mc10_8_c: 484.9 ( 1.00x) 475.2 ( 1.00x)
put_h264_qpel_8_mc10_8_rvv_i32: 199.1 ( 2.44x) 144.6 ( 3.29x)
put_h264_qpel_8_mc20_8_c: 353.6 ( 1.00x) 360.4 ( 1.00x)
put_h264_qpel_8_mc20_8_rvv_i32: 187.2 ( 1.89x) 133.5 ( 2.70x)
put_h264_qpel_8_mc30_8_c: 477.4 ( 1.00x) 448.4 ( 1.00x)
put_h264_qpel_8_mc30_8_rvv_i32: 199.0 ( 2.40x) 144.7 ( 3.10x)
put_h264_qpel_16_mc10_8_c: 1908.4 ( 1.00x) 1961.3 ( 1.00x)
put_h264_qpel_16_mc10_8_rvv_i32: 432.8 ( 4.41x) 351.2 ( 5.59x)
put_h264_qpel_16_mc20_8_c: 1459.0 ( 1.00x) 1446.4 ( 1.00x)
put_h264_qpel_16_mc20_8_rvv_i32: 403.1 ( 3.62x) 320.6 ( 4.51x)
put_h264_qpel_16_mc30_8_c: 1935.7 ( 1.00x) 1916.8 ( 1.00x)
put_h264_qpel_16_mc30_8_rvv_i32: 435.0 ( 4.45x) 353.0 ( 5.43x)
avg_h264_qpel_4_mc10_8_c: 133.6 ( 1.00x) 129.2 ( 1.00x)
avg_h264_qpel_4_mc10_8_rvv_i32: 105.7 ( 1.26x) 80.0 ( 1.62x)
avg_h264_qpel_4_mc20_8_c: 114.7 ( 1.00x) 122.8 ( 1.00x)
avg_h264_qpel_4_mc20_8_rvv_i32: 99.5 ( 1.15x) 73.1 ( 1.68x)
avg_h264_qpel_4_mc30_8_c: 128.0 ( 1.00x) 127.9 ( 1.00x)
avg_h264_qpel_4_mc30_8_rvv_i32: 105.3 ( 1.22x) 79.6 ( 1.61x)
avg_h264_qpel_8_mc10_8_c: 505.2 ( 1.00x) 494.9 ( 1.00x)
avg_h264_qpel_8_mc10_8_rvv_i32: 207.2 ( 2.44x) 152.2 ( 3.25x)
avg_h264_qpel_8_mc20_8_c: 421.0 ( 1.00x) 422.4 ( 1.00x)
avg_h264_qpel_8_mc20_8_rvv_i32: 195.6 ( 2.15x) 140.3 ( 3.01x)
avg_h264_qpel_8_mc30_8_c: 479.6 ( 1.00x) 500.0 ( 1.00x)
avg_h264_qpel_8_mc30_8_rvv_i32: 208.8 ( 2.30x) 153.8 ( 3.25x)
avg_h264_qpel_16_mc10_8_c: 2006.0 ( 1.00x) 2011.4 ( 1.00x)
avg_h264_qpel_16_mc10_8_rvv_i32: 462.5 ( 4.34x) 378.9 ( 5.31x)
avg_h264_qpel_16_mc20_8_c: 1761.2 ( 1.00x) 1749.9 ( 1.00x)
avg_h264_qpel_16_mc20_8_rvv_i32: 431.7 ( 4.08x) 348.0 ( 5.03x)
avg_h264_qpel_16_mc30_8_c: 1950.7 ( 1.00x) 1980.4 ( 1.00x)
avg_h264_qpel_16_mc30_8_rvv_i32: 464.4 ( 4.20x) 380.1 ( 5.21x)
Signed-off-by: zhanheng.yang <[email protected]>
---
libavcodec/riscv/h264qpel_rvv.S | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)
diff --git a/libavcodec/riscv/h264qpel_rvv.S b/libavcodec/riscv/h264qpel_rvv.S
index df6796748f..1b3737d1a6 100644
--- a/libavcodec/riscv/h264qpel_rvv.S
+++ b/libavcodec/riscv/h264qpel_rvv.S
@@ -2,6 +2,7 @@
* SPDX-License-Identifier: BSD-2-Clause
*
* Copyright (c) 2024 Niklas Haas
+ * Copyright (C) 2026 Alibaba Group Holding Limited
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are met:
@@ -53,20 +54,19 @@
/* output is unclipped; clobbers v26-v31 plus t0 and t02 */
.macro lowpass_h vdst, src
- addi t4, \src, 3
- lbu t5, 2(\src)
- vle8.v v31, (t4)
- lbu t4, 1(\src)
- vslide1up.vx v30, v31, t5
- lbu t5, 0(\src)
- vslide1up.vx v29, v30, t4
- lbu t4, -1(\src)
- vslide1up.vx v28, v29, t5
- lbu t5, -2(\src)
- vslide1up.vx v27, v28, t4
- vslide1up.vx v26, v27, t5
+ addi t4, \src, -2
+ vle8.v v26, (t4)
+ addi t5, \src, 3
+ vle8.v v31, (t5)
+ addi t4, \src, 1
+ vle8.v v28, (\src)
vwaddu.vv \vdst, v26, v31
+ vle8.v v29, (t4)
vwmaccu.vx \vdst, t6, v28
+ addi t4, \src, -1
+ vle8.v v27, (t4)
+ addi t5, \src, 2
+ vle8.v v30, (t5)
vwmaccu.vx \vdst, t6, v29
vwmaccsu.vx \vdst, a7, v27
vwmaccsu.vx \vdst, a7, v30
--
2.52.0
_______________________________________________
ffmpeg-devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]