https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121051
Bug ID: 121051 Summary: RVV intrinsics: unnecessary spilling and bad register allocation Product: gcc Version: 16.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: camel-cdr at protonmail dot com Target Milestone: --- GCC seems to have a lot of problems doing register allocation of RVV intrinsics for slightly more complicated code. I've noticed for quite some time now that GCC codegen for RVV intrinsics is often a lot worse then LLVM, because it generates redundant spills and moves. I now have a good code example to illustrate the problem: vuint16m8_t trans8x8_vslide(vuint16m8_t v) { size_t VL = __riscv_vsetvlmax_e64m4(); vbool16_t modd = __riscv_vreinterpret_b16( __riscv_vmv_v_x_u8m1(0b10101010, __riscv_vsetvlmax_e8m1())); vbool16_t meven = __riscv_vmnot(modd, VL); vbool16_t m; vuint64m4_t v4l = __riscv_vreinterpret_u64m4(__riscv_vget_u16m4(v, 0)); vuint64m4_t v4h = __riscv_vreinterpret_u64m4(__riscv_vget_u16m4(v, 1)); vuint64m4_t v4lt = v4l; m = modd; v4l = __riscv_vslide1up_mu(m, v4l, v4h, 0, VL); m = meven; v4h = __riscv_vslide1down_mu(m, v4h, v4lt, 0, VL); vuint32m2_t v2ll = __riscv_vreinterpret_u32m2(__riscv_vget_u64m2(v4l, 0)); vuint32m2_t v2lh = __riscv_vreinterpret_u32m2(__riscv_vget_u64m2(v4l, 1)); vuint32m2_t v2hl = __riscv_vreinterpret_u32m2(__riscv_vget_u64m2(v4h, 0)); vuint32m2_t v2hh = __riscv_vreinterpret_u32m2(__riscv_vget_u64m2(v4h, 1)); vuint32m2_t v2llt = v2lh, v2hlt = v2hh; v2lh = __riscv_vslide1down_mu(m, v2lh, v2ll, 0, VL); v2hh = __riscv_vslide1down_mu(m, v2hh, v2hl, 0, VL); m = modd; v2ll = __riscv_vslide1up_mu(m, v2ll, v2llt, 0, VL); v2hl = __riscv_vslide1up_mu(m, v2hl, v2hlt, 0, VL); vuint16m1_t v1lll = __riscv_vreinterpret_u16m1(__riscv_vget_u32m1(v2ll, 0)); vuint16m1_t v1llh = __riscv_vreinterpret_u16m1(__riscv_vget_u32m1(v2ll, 1)); vuint16m1_t v1lhl = __riscv_vreinterpret_u16m1(__riscv_vget_u32m1(v2lh, 0)); vuint16m1_t v1lhh = __riscv_vreinterpret_u16m1(__riscv_vget_u32m1(v2lh, 1)); vuint16m1_t v1hll = __riscv_vreinterpret_u16m1(__riscv_vget_u32m1(v2hl, 0)); vuint16m1_t v1hlh = __riscv_vreinterpret_u16m1(__riscv_vget_u32m1(v2hl, 1)); vuint16m1_t v1hhl = __riscv_vreinterpret_u16m1(__riscv_vget_u32m1(v2hh, 0)); vuint16m1_t v1hhh = __riscv_vreinterpret_u16m1(__riscv_vget_u32m1(v2hh, 1)); vuint16m1_t v1lllt = v1lll, v1lhlt = v1lhl, v1hllt = v1hll, v1hhlt = v1hhl; v1lll = __riscv_vslide1up_mu(m, v1lll, v1llh, 0, VL); v1lhl = __riscv_vslide1up_mu(m, v1lhl, v1lhh, 0, VL); v1hll = __riscv_vslide1up_mu(m, v1hll, v1hlh, 0, VL); v1hhl = __riscv_vslide1up_mu(m, v1hhl, v1hhh, 0, VL); m = meven; v1llh = __riscv_vslide1down_mu(m, v1llh, v1lllt, 0, VL); v1lhh = __riscv_vslide1down_mu(m, v1lhh, v1lhlt, 0, VL); v1hlh = __riscv_vslide1down_mu(m, v1hlh, v1hllt, 0, VL); v1hhh = __riscv_vslide1down_mu(m, v1hhh, v1hhlt, 0, VL); return __riscv_vcreate_v_u16m1_u16m8( v1lll, v1llh, v1lhl, v1lhh, v1hll, v1hlh, v1hhl, v1hhh); } The above code is a port of an assembly function that transposes 8x8 matrices stored across 8 vector registers: https://github.com/camel-cdr/rvv-bench/blob/65a8dec6f238a45f22b26d02638471bfe68461c5/bench/trans8x8.S.inc#L291 This seems to currently be the best method for implementing such a transpose, GCC however fails to produce acceptable assembly code: https://godbolt.org/z/KM7ejn9f9 It spills 12 vector registers, while the assembly version only clobbers 12 vector registers. There are also a lot of redundant moves and in total more than 3x more instructions where generated (117 vs 32).