From: Juzhe-Zhong <juzhe.zh...@rivai.ai>
According to RVV ISA:
https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc
We can enhance VLA SLP auto-vectorization with (16.5.1. Synthesizing
vdecompress)
Decompress operation.
Case 1 (nunits = POLY_INT_CST [16, 16]):
_48 = VEC_PERM_EXPR <_37, _35, { 0, POLY_INT_CST [16, 16], 1, POLY_INT_CST [17,
16], 2, POLY_INT_CST [18, 16], ... }>;
We can optimize such VLA SLP permuation pattern into:
_48 = vdecompress (_37, _35, mask = { 0, 1, 0, 1, ... };
Case 2 (nunits = POLY_INT_CST [16, 16]):
_23 = VEC_PERM_EXPR <_46, _44, { POLY_INT_CST [1, 1], POLY_INT_CST [3, 3],
POLY_INT_CST [2, 1], POLY_INT_CST [4, 3], POLY_INT_CST [3, 1], POLY_INT_CST [5, 3],
... }>;
We can optimize such VLA SLP permuation pattern into:
_48 = vdecompress (slidedown(_46, 1/2 nunits), slidedown(_44, 1/2 nunits), mask
= { 0, 1, 0, 1, ... };
For example:
void __attribute__ ((noinline, noclone))
vec_slp (uint64_t *restrict a, uint64_t b, uint64_t c, int n)
{
for (int i = 0; i < n; ++i)
{
a[i * 2] += b;
a[i * 2 + 1] += c;
}
}
ASM:
...
vid.v v0
vand.vi v0,v0,1
vmseq.vi v0,v0,1 ===> mask = { 0, 1, 0, 1, ... }
vdecompress:
viota.m v3,v0
vrgather.vv v2,v1,v3,v0.t
Loop:
vsetvli zero,a5,e64,m1,ta,ma
vle64.v v1,0(a0)
vsetvli a6,zero,e64,m1,ta,ma
vadd.vv v1,v2,v1
vsetvli zero,a5,e64,m1,ta,ma
mv a5,a3
vse64.v v1,0(a0)
add a3,a3,a1
add a0,a0,a2
bgtu a5,a4,.L4
gcc/ChangeLog:
* config/riscv/riscv-v.cc (emit_vlmax_decompress_insn): New function.
(shuffle_decompress_patterns): New function.
(expand_vec_perm_const_1): Add decompress optimization.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/partial/slp-8.c: New test.
* gcc.target/riscv/rvv/autovec/partial/slp-9.c: New test.
* gcc.target/riscv/rvv/autovec/partial/slp_run-8.c: New test.
* gcc.target/riscv/rvv/autovec/partial/slp_run-9.c: New test.