[Bug target/122152] riscv64 uses a vector segmented load instead of a vector strided load

rdapp at gcc dot gnu.org via Gcc-bugs Thu, 12 Mar 2026 03:36:36 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122152


--- Comment #6 from Robin Dapp <rdapp at gcc dot gnu.org> ---
(In reply to Lin Li from comment #4)
> Here is the performance data of 462.libquantum running on the SG2044 with
> GCC15.2.
> 
>             mno-autovec-segment    zvl   mrvv-max-lmul  SG2044 1copy ratio
> base                                                          58.6
> noseg            √                                            58.9          
> 
> base-zvl                            √                         58.6
> noseg-zvl        √                  √                         49.8
> noseg-m2         √                           m2               61.6
> noseg-zvl-m1     √                  √        m1               50.0
> noseg-zvl-m2     √                  √        m2                8.9
> noseg-zvl-dyn    √                  √      dynamic            22.5
> 
> > the whole compilation options are 
> > '-march=rv64gcv_zba_zbb_zbc_zbs_zicond_zvl128b -mrvv-vector-bits=zvl 
> > -mrvv-max-lmul=m2 -mno-autovec-segment'.
> 
> Above zvl-noseg-m2/dyn tests are the scenarios I mentioned. The following
> are the assembly code of quantum_toffoli on noseg-zvl-m1 and noseg-zvl-m2.
> 
> loc_12A30(noseg-zvl-m1):
>   minu            a5, a3, t4
>   minu            t1, a5, t3
>   sub             a5, a5, t1
>   vsetvli         zero, a5, e64, ta, ma
>   vle64.v         v2, (a7)
>   vsetvli         zero, t1, e64, ta, ma
>   vle64.v         v1, (a4)
>   vsetivli        zero, 10h, e8, ta, ma
>   minu            a5, a6, t3
>   mv              t1, a3
>   addi            a7, a7, 20h # ' '
>   addi            a3, a3, -4
>   vmv1r.v         v7, v1
>   addi            a6, a6, -2
>   vslideup.vi     v7, v2, 8
>   vsetivli        zero, 2, e32, mf2, ta, ma
>   vnsrl.wx        v5, v7, a0
>   vnsrl.wx        v4, v7, t5
>   vnsrl.wx        v2, v7, s0
>   vnsrl.wx        v3, v7, a2
>   vxor.vv         v4, v4, v5
>   vnsrl.wx        v0, v7, t0
>   vxor.vv         v3, v3, v2
>   vnsrl.wx        v2, v7, t2
>   vxor.vv         v0, v0, v4
>   vsetvli         zero, zero, e64, ta, ma
>   vxor.vv         v1, v6, v7
>   vsetvli         zero, zero, e32, mf2, ta, ma
>   vxor.vv         v2, v2, v3
>   vand.vv         v0, v0, v2
>   vand.vi         v0, v0, 1
>   vmsne.vi        v0, v0, 0
>   vsetvli         zero, a5, e64, ta, ma
>   vsse64.v        v1, (a4), a1, v0.t
>   addi            a4, a4, 20h # ' '
>   bltu            t4, t1, loc_12A30
> 
> loc_142EE(noseg-zvl-m2):
>   slli            a1, a4, 1
>   minu            a5, a1, a7
>   minu            s8, a5, a0
>   sub             a5, a5, s8
>   vsetvli         zero, a5, e64, m2, ta, ma
>   vle64.v         v2, (a2)
>   vsetvli         zero, s8, e64, m2, ta, ma
>   vle64.v         v8, (a3)
>   vmv1r.v         v0, v6
>   vsetivli        zero, 4, e64, m2, ta, ma
>   minu            a5, a4, a0
>   vcompress.vm    v4, v2, v0
>   addi            a2, a2, 40h # '@'
>   vcompress.vm    v2, v8, v0
>   addi            a4, a4, -4
>   vslideup.vi     v2, v4, 2
>   vsetvli         zero, zero, e32, ta, ma
>   vnsrl.wx        v7, v2, s2
>   vnsrl.wx        v5, v2, s1
>   vnsrl.wx        v1, v2, t2
>   vnsrl.wx        v4, v2, t0
>   vnsrl.wx        v0, v2, s0
>   vxor.vv         v5, v5, v7
>   vxor.vv         v4, v4, v1
>   vnsrl.wx        v1, v2, t6
>   vxor.vv         v0, v0, v5
>   vsetvli         zero, zero, e64, m2, ta, ma
>   vxor.vv         v2, v10, v2
>   vsetvli         zero, zero, e32, ta, ma
>   vxor.vv         v1, v1, v4
>   vand.vv         v0, v0, v1
>   vand.vi         v0, v0, 1
>   vmsne.vi        v0, v0, 0
>   vsetvli         zero, a5, e64, m2, ta, ma
>   vsse64.v        v2, (a3), t5, v0.t
>   addi            a3, a3, 40h # '@'
>   bltu            a7, a1, loc_142EE
> 
> 
> vcompress is the main difference and it's indeed a single-cycle instruction. 
> 
> Referring to the reply from a colleague in the architecture department, the
> pmspc instructions can only use VX3 when LMUL > 1, and can use VX3/VX1 when
> LMUL = 1. More importantly, at this point, the pmspc instructions will be
> executed in non-pipeline mode, which results in very low efficiency because
> the WB needs to wait for the results of each MOP.(The segment load/store
> operation seems to be for similar reasons as well). 
> 
> He suggested that the pmspc instructions(including slide, gather, and
> compress) should avoid using scenarios where LMUL > 1 as much as possible.
> 
> I think the main problem is that the compiler does not offer an option for
> us to decide whether to generate the vcompress instruction or not, like
> 'mno-autovec-segment'.

Thank you for the detailed response, this helps understand what you're facing. 
Also, it makes sense to me: It's not unreasonable for uarchs to become
comparatively less efficient for LMUL > 1 when gathers (and related) are
involved as the complexity can grow quadratically.

The question is how to deal with it.

For the snippet in question (or libquantum) we currently have three choices:
 (1) use segmented loads (VLA case)
 (2) use strided loads
 (3) use an emulated "interleave" (VLS case)

On the trunk we prefer (3) because a VLS loop can be unrolled, that's one
tie-breaker condition we use.  That tie-breaker was already present in GCC 15,
though, so I wonder why it kicks in now.

As you already mentioned, (1) can be improved by using strided loads and we
actually already do that in GCC 15 when segmented loads are not available. 
What's still missing is to use strided loads as fallback even when segmented
loads are available.  For single-element interleaving strided loads should
always be preferable, no matter the segment size.

That's a vectorizer task.

Another, target-only, improvement would be to enable the segmented loads for
VLS modes.  Currently they are VLA only.  That way we could benefit from
explicit unrolling.  That would help loops with real (non single-element)
segmented loads.

I don't see an easy way to help (2) directly.  The fallback for the vcompress
interleave approach is two vrgathers (the second one masked).  We need to
interleave the vectors somehow:
  vect__4.12_74 = VEC_PERM_EXPR <vect__4.10_71, vect__4.11_73, { 0, 2, 4, 6 }>;
If gathers are also not available, we give up.

So the only thing (IMHO) we can do is prevent this situation altogether by
target- and uarch-specific costing.  For your uarch we'd increase gather,
compress, ... costs significantly if the mode is LMUL > 1.  The final costs
would then hopefully reject LMUL2 and choose LMUL1.

I'll check the tiebreaker, that looks like a 16 regression :/

[Bug target/122152] riscv64 uses a vector segmented load instead of a vector strided load

Reply via email to