https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119911
Bug ID: 119911
Summary: [RVV] Suboptimal code generation for multiple
extracting 0-th elements of vector
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: wojciech_mula at poczta dot onet.pl
Target Milestone: ---
I observed the issue on GCC 14.2, but it's still visible on the godbolt trunk,
which is 16.0.0 20250423 (experimental).
Summary: when we have multiple `vmv.x.s` (move the 0th vector element into a
scalar register), GCC always emit shift-left then shift-right to apply masking
of result lower bits (like 8 or 16). However, when there are more `vmv.x.s`
instances, then it would be profitable to create the mask in a register (which
is a compile-time const) and use bit-and for masking.
Clang performs this optimization.
Consider this simple function:
---test.cpp---
#include <riscv_vector.h>
#include <cstdint>
uint64_t sum_of_first_three(vuint16m1_t x) {
const uint64_t mask = 0xffff;
const auto vl = __riscv_vsetvlmax_e16m1();
return uint64_t(__riscv_vmv_x_s_u16m1_u16(x))
+ uint64_t(__riscv_vmv_x_s_u16m1_u16(__riscv_vslidedown(x, 1, vl)))
+ uint64_t(__riscv_vmv_x_s_u16m1_u16(__riscv_vslidedown(x, 2, vl)));
}
---eof---
When compiled with `-O3 -march=rv64gcv`, the assembly is:
---
sum_of_first_three(__rvv_uint16m1_t):
vsetvli a5,zero,e16,m1,ta,ma
vslidedown.vi v10,v8,1
vslidedown.vi v9,v8,2
vmv.x.s a5,v8
vmv.x.s a4,v10
vmv.x.s a0,v9
slli a4,a4,48
slli a5,a5,48
srli a4,a4,48
srli a5,a5,48
slli a0,a0,48
add a5,a5,a4
srli a0,a0,48
add a0,a5,a0
ret
---
godbolt link: https://godbolt.org/z/hPrM8vz4v