https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122152

--- Comment #4 from Lin Li <lilin at masscore dot cn> ---
Here is the performance data of 462.libquantum running on the SG2044 with
GCC15.2.

            mno-autovec-segment    zvl   mrvv-max-lmul  SG2044 1copy ratio
base                                                          58.6
noseg            √                                            58.9              
base-zvl                            √                         58.6
noseg-zvl        √                  √                         49.8
noseg-m2         √                           m2               61.6
noseg-zvl-m1     √                  √        m1               50.0
noseg-zvl-m2     √                  √        m2                8.9
noseg-zvl-dyn    √                  √      dynamic            22.5

> the whole compilation options are 
> '-march=rv64gcv_zba_zbb_zbc_zbs_zicond_zvl128b -mrvv-vector-bits=zvl 
> -mrvv-max-lmul=m2 -mno-autovec-segment'.

Above zvl-noseg-m2/dyn tests are the scenarios I mentioned. The following are
the assembly code of quantum_toffoli on noseg-zvl-m1 and noseg-zvl-m2.

loc_12A30(noseg-zvl-m1):
  minu            a5, a3, t4
  minu            t1, a5, t3
  sub             a5, a5, t1
  vsetvli         zero, a5, e64, ta, ma
  vle64.v         v2, (a7)
  vsetvli         zero, t1, e64, ta, ma
  vle64.v         v1, (a4)
  vsetivli        zero, 10h, e8, ta, ma
  minu            a5, a6, t3
  mv              t1, a3
  addi            a7, a7, 20h # ' '
  addi            a3, a3, -4
  vmv1r.v         v7, v1
  addi            a6, a6, -2
  vslideup.vi     v7, v2, 8
  vsetivli        zero, 2, e32, mf2, ta, ma
  vnsrl.wx        v5, v7, a0
  vnsrl.wx        v4, v7, t5
  vnsrl.wx        v2, v7, s0
  vnsrl.wx        v3, v7, a2
  vxor.vv         v4, v4, v5
  vnsrl.wx        v0, v7, t0
  vxor.vv         v3, v3, v2
  vnsrl.wx        v2, v7, t2
  vxor.vv         v0, v0, v4
  vsetvli         zero, zero, e64, ta, ma
  vxor.vv         v1, v6, v7
  vsetvli         zero, zero, e32, mf2, ta, ma
  vxor.vv         v2, v2, v3
  vand.vv         v0, v0, v2
  vand.vi         v0, v0, 1
  vmsne.vi        v0, v0, 0
  vsetvli         zero, a5, e64, ta, ma
  vsse64.v        v1, (a4), a1, v0.t
  addi            a4, a4, 20h # ' '
  bltu            t4, t1, loc_12A30

loc_142EE(noseg-zvl-m2):
  slli            a1, a4, 1
  minu            a5, a1, a7
  minu            s8, a5, a0
  sub             a5, a5, s8
  vsetvli         zero, a5, e64, m2, ta, ma
  vle64.v         v2, (a2)
  vsetvli         zero, s8, e64, m2, ta, ma
  vle64.v         v8, (a3)
  vmv1r.v         v0, v6
  vsetivli        zero, 4, e64, m2, ta, ma
  minu            a5, a4, a0
  vcompress.vm    v4, v2, v0
  addi            a2, a2, 40h # '@'
  vcompress.vm    v2, v8, v0
  addi            a4, a4, -4
  vslideup.vi     v2, v4, 2
  vsetvli         zero, zero, e32, ta, ma
  vnsrl.wx        v7, v2, s2
  vnsrl.wx        v5, v2, s1
  vnsrl.wx        v1, v2, t2
  vnsrl.wx        v4, v2, t0
  vnsrl.wx        v0, v2, s0
  vxor.vv         v5, v5, v7
  vxor.vv         v4, v4, v1
  vnsrl.wx        v1, v2, t6
  vxor.vv         v0, v0, v5
  vsetvli         zero, zero, e64, m2, ta, ma
  vxor.vv         v2, v10, v2
  vsetvli         zero, zero, e32, ta, ma
  vxor.vv         v1, v1, v4
  vand.vv         v0, v0, v1
  vand.vi         v0, v0, 1
  vmsne.vi        v0, v0, 0
  vsetvli         zero, a5, e64, m2, ta, ma
  vsse64.v        v2, (a3), t5, v0.t
  addi            a3, a3, 40h # '@'
  bltu            a7, a1, loc_142EE


vcompress is the main difference and it's indeed a single-cycle instruction. 

Referring to the reply from a colleague in the architecture department, the
pmspc instructions can only use VX3 when LMUL > 1, and can use VX3/VX1 when
LMUL = 1. More importantly, at this point, the pmspc instructions will be
executed in non-pipeline mode, which results in very low efficiency because the
WB needs to wait for the results of each MOP.(The segment load/store operation
seems to be for similar reasons as well). 

He suggested that the pmspc instructions(including slide, gather, and compress)
should avoid using scenarios where LMUL > 1 as much as possible.

I think the main problem is that the compiler does not offer an option for us
to decide whether to generate the vcompress instruction or not, like
'mno-autovec-segment'.

Reply via email to