https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122152
--- Comment #6 from Robin Dapp <rdapp at gcc dot gnu.org> ---
(In reply to Lin Li from comment #4)
> Here is the performance data of 462.libquantum running on the SG2044 with
> GCC15.2.
>
> mno-autovec-segment zvl mrvv-max-lmul SG2044 1copy ratio
> base 58.6
> noseg √ 58.9
>
> base-zvl √ 58.6
> noseg-zvl √ √ 49.8
> noseg-m2 √ m2 61.6
> noseg-zvl-m1 √ √ m1 50.0
> noseg-zvl-m2 √ √ m2 8.9
> noseg-zvl-dyn √ √ dynamic 22.5
>
> > the whole compilation options are
> > '-march=rv64gcv_zba_zbb_zbc_zbs_zicond_zvl128b -mrvv-vector-bits=zvl
> > -mrvv-max-lmul=m2 -mno-autovec-segment'.
>
> Above zvl-noseg-m2/dyn tests are the scenarios I mentioned. The following
> are the assembly code of quantum_toffoli on noseg-zvl-m1 and noseg-zvl-m2.
>
> loc_12A30(noseg-zvl-m1):
> minu a5, a3, t4
> minu t1, a5, t3
> sub a5, a5, t1
> vsetvli zero, a5, e64, ta, ma
> vle64.v v2, (a7)
> vsetvli zero, t1, e64, ta, ma
> vle64.v v1, (a4)
> vsetivli zero, 10h, e8, ta, ma
> minu a5, a6, t3
> mv t1, a3
> addi a7, a7, 20h # ' '
> addi a3, a3, -4
> vmv1r.v v7, v1
> addi a6, a6, -2
> vslideup.vi v7, v2, 8
> vsetivli zero, 2, e32, mf2, ta, ma
> vnsrl.wx v5, v7, a0
> vnsrl.wx v4, v7, t5
> vnsrl.wx v2, v7, s0
> vnsrl.wx v3, v7, a2
> vxor.vv v4, v4, v5
> vnsrl.wx v0, v7, t0
> vxor.vv v3, v3, v2
> vnsrl.wx v2, v7, t2
> vxor.vv v0, v0, v4
> vsetvli zero, zero, e64, ta, ma
> vxor.vv v1, v6, v7
> vsetvli zero, zero, e32, mf2, ta, ma
> vxor.vv v2, v2, v3
> vand.vv v0, v0, v2
> vand.vi v0, v0, 1
> vmsne.vi v0, v0, 0
> vsetvli zero, a5, e64, ta, ma
> vsse64.v v1, (a4), a1, v0.t
> addi a4, a4, 20h # ' '
> bltu t4, t1, loc_12A30
>
> loc_142EE(noseg-zvl-m2):
> slli a1, a4, 1
> minu a5, a1, a7
> minu s8, a5, a0
> sub a5, a5, s8
> vsetvli zero, a5, e64, m2, ta, ma
> vle64.v v2, (a2)
> vsetvli zero, s8, e64, m2, ta, ma
> vle64.v v8, (a3)
> vmv1r.v v0, v6
> vsetivli zero, 4, e64, m2, ta, ma
> minu a5, a4, a0
> vcompress.vm v4, v2, v0
> addi a2, a2, 40h # '@'
> vcompress.vm v2, v8, v0
> addi a4, a4, -4
> vslideup.vi v2, v4, 2
> vsetvli zero, zero, e32, ta, ma
> vnsrl.wx v7, v2, s2
> vnsrl.wx v5, v2, s1
> vnsrl.wx v1, v2, t2
> vnsrl.wx v4, v2, t0
> vnsrl.wx v0, v2, s0
> vxor.vv v5, v5, v7
> vxor.vv v4, v4, v1
> vnsrl.wx v1, v2, t6
> vxor.vv v0, v0, v5
> vsetvli zero, zero, e64, m2, ta, ma
> vxor.vv v2, v10, v2
> vsetvli zero, zero, e32, ta, ma
> vxor.vv v1, v1, v4
> vand.vv v0, v0, v1
> vand.vi v0, v0, 1
> vmsne.vi v0, v0, 0
> vsetvli zero, a5, e64, m2, ta, ma
> vsse64.v v2, (a3), t5, v0.t
> addi a3, a3, 40h # '@'
> bltu a7, a1, loc_142EE
>
>
> vcompress is the main difference and it's indeed a single-cycle instruction.
>
> Referring to the reply from a colleague in the architecture department, the
> pmspc instructions can only use VX3 when LMUL > 1, and can use VX3/VX1 when
> LMUL = 1. More importantly, at this point, the pmspc instructions will be
> executed in non-pipeline mode, which results in very low efficiency because
> the WB needs to wait for the results of each MOP.(The segment load/store
> operation seems to be for similar reasons as well).
>
> He suggested that the pmspc instructions(including slide, gather, and
> compress) should avoid using scenarios where LMUL > 1 as much as possible.
>
> I think the main problem is that the compiler does not offer an option for
> us to decide whether to generate the vcompress instruction or not, like
> 'mno-autovec-segment'.
Thank you for the detailed response, this helps understand what you're facing.
Also, it makes sense to me: It's not unreasonable for uarchs to become
comparatively less efficient for LMUL > 1 when gathers (and related) are
involved as the complexity can grow quadratically.
The question is how to deal with it.
For the snippet in question (or libquantum) we currently have three choices:
(1) use segmented loads (VLA case)
(2) use strided loads
(3) use an emulated "interleave" (VLS case)
On the trunk we prefer (3) because a VLS loop can be unrolled, that's one
tie-breaker condition we use. That tie-breaker was already present in GCC 15,
though, so I wonder why it kicks in now.
As you already mentioned, (1) can be improved by using strided loads and we
actually already do that in GCC 15 when segmented loads are not available.
What's still missing is to use strided loads as fallback even when segmented
loads are available. For single-element interleaving strided loads should
always be preferable, no matter the segment size.
That's a vectorizer task.
Another, target-only, improvement would be to enable the segmented loads for
VLS modes. Currently they are VLA only. That way we could benefit from
explicit unrolling. That would help loops with real (non single-element)
segmented loads.
I don't see an easy way to help (2) directly. The fallback for the vcompress
interleave approach is two vrgathers (the second one masked). We need to
interleave the vectors somehow:
vect__4.12_74 = VEC_PERM_EXPR <vect__4.10_71, vect__4.11_73, { 0, 2, 4, 6 }>;
If gathers are also not available, we give up.
So the only thing (IMHO) we can do is prevent this situation altogether by
target- and uarch-specific costing. For your uarch we'd increase gather,
compress, ... costs significantly if the mode is LMUL > 1. The final costs
would then hopefully reject LMUL2 and choose LMUL1.
I'll check the tiebreaker, that looks like a 16 regression :/