[Bug target/121051] New: RVV intrinsics: unnecessary spilling and bad register allocation

camel-cdr at protonmail dot com via Gcc-bugs Sun, 13 Jul 2025 03:32:58 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121051


            Bug ID: 121051
           Summary: RVV intrinsics: unnecessary spilling and bad register
                    allocation
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: camel-cdr at protonmail dot com
  Target Milestone: ---

GCC seems to have a lot of problems doing register allocation of RVV intrinsics
for slightly more complicated code.
I've noticed for quite some time now that GCC codegen for RVV intrinsics is
often a lot worse then LLVM, because it generates redundant spills and moves.

I now have a good code example to illustrate the problem:


vuint16m8_t
trans8x8_vslide(vuint16m8_t v)
{
        size_t VL = __riscv_vsetvlmax_e64m4();
        vbool16_t modd  = __riscv_vreinterpret_b16(
                        __riscv_vmv_v_x_u8m1(0b10101010,
__riscv_vsetvlmax_e8m1()));
        vbool16_t meven = __riscv_vmnot(modd, VL);
        vbool16_t m;

        vuint64m4_t v4l = __riscv_vreinterpret_u64m4(__riscv_vget_u16m4(v, 0));
        vuint64m4_t v4h = __riscv_vreinterpret_u64m4(__riscv_vget_u16m4(v, 1));
        vuint64m4_t v4lt = v4l;
        m = modd;
        v4l = __riscv_vslide1up_mu(m, v4l, v4h, 0, VL);
        m = meven;
        v4h = __riscv_vslide1down_mu(m, v4h, v4lt, 0, VL);

        vuint32m2_t v2ll = __riscv_vreinterpret_u32m2(__riscv_vget_u64m2(v4l,
0));
        vuint32m2_t v2lh = __riscv_vreinterpret_u32m2(__riscv_vget_u64m2(v4l,
1));
        vuint32m2_t v2hl = __riscv_vreinterpret_u32m2(__riscv_vget_u64m2(v4h,
0));
        vuint32m2_t v2hh = __riscv_vreinterpret_u32m2(__riscv_vget_u64m2(v4h,
1));
        vuint32m2_t v2llt = v2lh, v2hlt = v2hh;
        v2lh = __riscv_vslide1down_mu(m, v2lh, v2ll, 0, VL);
        v2hh = __riscv_vslide1down_mu(m, v2hh, v2hl, 0, VL);
        m = modd;
        v2ll = __riscv_vslide1up_mu(m, v2ll, v2llt, 0, VL);
        v2hl = __riscv_vslide1up_mu(m, v2hl, v2hlt, 0, VL);

        vuint16m1_t v1lll = __riscv_vreinterpret_u16m1(__riscv_vget_u32m1(v2ll,
0));
        vuint16m1_t v1llh = __riscv_vreinterpret_u16m1(__riscv_vget_u32m1(v2ll,
1));
        vuint16m1_t v1lhl = __riscv_vreinterpret_u16m1(__riscv_vget_u32m1(v2lh,
0));
        vuint16m1_t v1lhh = __riscv_vreinterpret_u16m1(__riscv_vget_u32m1(v2lh,
1));
        vuint16m1_t v1hll = __riscv_vreinterpret_u16m1(__riscv_vget_u32m1(v2hl,
0));
        vuint16m1_t v1hlh = __riscv_vreinterpret_u16m1(__riscv_vget_u32m1(v2hl,
1));
        vuint16m1_t v1hhl = __riscv_vreinterpret_u16m1(__riscv_vget_u32m1(v2hh,
0));
        vuint16m1_t v1hhh = __riscv_vreinterpret_u16m1(__riscv_vget_u32m1(v2hh,
1));
        vuint16m1_t v1lllt = v1lll, v1lhlt = v1lhl, v1hllt = v1hll, v1hhlt =
v1hhl;
        v1lll = __riscv_vslide1up_mu(m, v1lll, v1llh, 0, VL);
        v1lhl = __riscv_vslide1up_mu(m, v1lhl, v1lhh, 0, VL);
        v1hll = __riscv_vslide1up_mu(m, v1hll, v1hlh, 0, VL);
        v1hhl = __riscv_vslide1up_mu(m, v1hhl, v1hhh, 0, VL);
        m = meven;
        v1llh = __riscv_vslide1down_mu(m, v1llh, v1lllt, 0, VL);
        v1lhh = __riscv_vslide1down_mu(m, v1lhh, v1lhlt, 0, VL);
        v1hlh = __riscv_vslide1down_mu(m, v1hlh, v1hllt, 0, VL);
        v1hhh = __riscv_vslide1down_mu(m, v1hhh, v1hhlt, 0, VL);

        return __riscv_vcreate_v_u16m1_u16m8(
                        v1lll, v1llh, v1lhl, v1lhh,
                        v1hll, v1hlh, v1hhl, v1hhh);
}

The above code is a port of an assembly function that transposes 8x8 matrices
stored across 8 vector registers:
https://github.com/camel-cdr/rvv-bench/blob/65a8dec6f238a45f22b26d02638471bfe68461c5/bench/trans8x8.S.inc#L291

This seems to currently be the best method for implementing such a transpose,
GCC however fails to produce acceptable assembly code:
https://godbolt.org/z/KM7ejn9f9

It spills 12 vector registers, while the assembly version only clobbers 12
vector registers. There are also a lot of redundant moves and in total more
than
3x more instructions where generated (117 vs 32).

[Bug target/121051] New: RVV intrinsics: unnecessary spilling and bad register allocation

Reply via email to