https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122152
--- Comment #4 from Lin Li <lilin at masscore dot cn> ---
Here is the performance data of 462.libquantum running on the SG2044 with
GCC15.2.
mno-autovec-segment zvl mrvv-max-lmul SG2044 1copy ratio
base 58.6
noseg √ 58.9
base-zvl √ 58.6
noseg-zvl √ √ 49.8
noseg-m2 √ m2 61.6
noseg-zvl-m1 √ √ m1 50.0
noseg-zvl-m2 √ √ m2 8.9
noseg-zvl-dyn √ √ dynamic 22.5
> the whole compilation options are
> '-march=rv64gcv_zba_zbb_zbc_zbs_zicond_zvl128b -mrvv-vector-bits=zvl
> -mrvv-max-lmul=m2 -mno-autovec-segment'.
Above zvl-noseg-m2/dyn tests are the scenarios I mentioned. The following are
the assembly code of quantum_toffoli on noseg-zvl-m1 and noseg-zvl-m2.
loc_12A30(noseg-zvl-m1):
minu a5, a3, t4
minu t1, a5, t3
sub a5, a5, t1
vsetvli zero, a5, e64, ta, ma
vle64.v v2, (a7)
vsetvli zero, t1, e64, ta, ma
vle64.v v1, (a4)
vsetivli zero, 10h, e8, ta, ma
minu a5, a6, t3
mv t1, a3
addi a7, a7, 20h # ' '
addi a3, a3, -4
vmv1r.v v7, v1
addi a6, a6, -2
vslideup.vi v7, v2, 8
vsetivli zero, 2, e32, mf2, ta, ma
vnsrl.wx v5, v7, a0
vnsrl.wx v4, v7, t5
vnsrl.wx v2, v7, s0
vnsrl.wx v3, v7, a2
vxor.vv v4, v4, v5
vnsrl.wx v0, v7, t0
vxor.vv v3, v3, v2
vnsrl.wx v2, v7, t2
vxor.vv v0, v0, v4
vsetvli zero, zero, e64, ta, ma
vxor.vv v1, v6, v7
vsetvli zero, zero, e32, mf2, ta, ma
vxor.vv v2, v2, v3
vand.vv v0, v0, v2
vand.vi v0, v0, 1
vmsne.vi v0, v0, 0
vsetvli zero, a5, e64, ta, ma
vsse64.v v1, (a4), a1, v0.t
addi a4, a4, 20h # ' '
bltu t4, t1, loc_12A30
loc_142EE(noseg-zvl-m2):
slli a1, a4, 1
minu a5, a1, a7
minu s8, a5, a0
sub a5, a5, s8
vsetvli zero, a5, e64, m2, ta, ma
vle64.v v2, (a2)
vsetvli zero, s8, e64, m2, ta, ma
vle64.v v8, (a3)
vmv1r.v v0, v6
vsetivli zero, 4, e64, m2, ta, ma
minu a5, a4, a0
vcompress.vm v4, v2, v0
addi a2, a2, 40h # '@'
vcompress.vm v2, v8, v0
addi a4, a4, -4
vslideup.vi v2, v4, 2
vsetvli zero, zero, e32, ta, ma
vnsrl.wx v7, v2, s2
vnsrl.wx v5, v2, s1
vnsrl.wx v1, v2, t2
vnsrl.wx v4, v2, t0
vnsrl.wx v0, v2, s0
vxor.vv v5, v5, v7
vxor.vv v4, v4, v1
vnsrl.wx v1, v2, t6
vxor.vv v0, v0, v5
vsetvli zero, zero, e64, m2, ta, ma
vxor.vv v2, v10, v2
vsetvli zero, zero, e32, ta, ma
vxor.vv v1, v1, v4
vand.vv v0, v0, v1
vand.vi v0, v0, 1
vmsne.vi v0, v0, 0
vsetvli zero, a5, e64, m2, ta, ma
vsse64.v v2, (a3), t5, v0.t
addi a3, a3, 40h # '@'
bltu a7, a1, loc_142EE
vcompress is the main difference and it's indeed a single-cycle instruction.
Referring to the reply from a colleague in the architecture department, the
pmspc instructions can only use VX3 when LMUL > 1, and can use VX3/VX1 when
LMUL = 1. More importantly, at this point, the pmspc instructions will be
executed in non-pipeline mode, which results in very low efficiency because the
WB needs to wait for the results of each MOP.(The segment load/store operation
seems to be for similar reasons as well).
He suggested that the pmspc instructions(including slide, gather, and compress)
should avoid using scenarios where LMUL > 1 as much as possible.
I think the main problem is that the compiler does not offer an option for us
to decide whether to generate the vcompress instruction or not, like
'mno-autovec-segment'.