https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120687

            Bug ID: 120687
           Summary: RISC-V: very poor vector code gen for LMbench bw_mem
                    test case
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bergner at gcc dot gnu.org
  Target Milestone: ---

We've seen some VERY poor vector code gen for the bw_mem test case in LMbench
for certain numbers of loads in the loop.  I have extracted a simple test case
that shows the issue.  For smallish numbers of loads, I get what I would
generally expect:

linux%~:PR$ cat bw_mem_8.c
int
frd (int *p, int *lastone)
{
  int sum = 0;
  for (; p <= lastone; p += 8)
    sum += p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7];
  return sum;
}
linux%~$ riscv64-unknown-linux-gnu-gcc -S -O3 -march=rv64gcv bw_mem_8.c
linux%~$ cat bw_mem_8.s
[snip] looking at just main loop body...
.L3:
        vsetvli a5,a4,e32,m1,tu,ma
        vlseg8e32.v     v8,(a0)
        slli    a3,a5,5
        sub     a4,a4,a5
        add     a0,a0,a3
        vadd.vv v1,v9,v8
        vadd.vv v1,v1,v10
        vadd.vv v1,v1,v11
        vadd.vv v1,v1,v12
        vadd.vv v1,v1,v13
        vadd.vv v1,v1,v14
        vadd.vv v1,v1,v15
        vadd.vv v2,v1,v2
        bne     a4,zero,.L3
[snip]

If I double the number of loads and update the loop increment to match, I see:
linux%~$ cat bw_mem_16.c
int
frd (int *p, int *lastone)
{
  int sum = 0;
  for (; p <= lastone; p += 16)
    sum += p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7]
           + p[8] + p[9] + p[10] + p[11] + p[12] + p[13] + p[14] + p[15];
  return sum;
}
linux%~$ riscv64-unknown-linux-gnu-gcc -S -O3 -march=rv64gcv bw_mem_16.c
linux%~$ cat bw_mem_16.s
[snip] looking at just main loop body...
.L4:
        add     s4,a4,a3
        addi    s2,s4,16
        vle32.v v11,0(s2)
        vmv1r.v v0,v2
        vle32.v v13,0(s4)
        addi    s1,s4,32
        vle32.v v19,0(s1)
        addi    s0,s4,48
        vcompress.vm    v21,v11,v0
        vmv1r.v v0,v1
        vle32.v v10,0(s0)
        addi    s5,s4,64
        vcompress.vm    v20,v11,v0
        vmv1r.v v0,v2
        vle32.v v18,0(s5)
        addi    t2,s4,80
        vcompress.vm    v11,v13,v0
        vmv1r.v v0,v1
        vle32.v v9,0(t2)
        vslideup.vi     v11,v21,2
        vcompress.vm    v15,v13,v0
        vmv1r.v v0,v2
        addi    t0,s4,96
        vslideup.vi     v15,v20,2
        vcompress.vm    v13,v19,v0
        vmv1r.v v0,v1
        vle32.v v14,0(t0)
        addi    t6,s4,112
        vcompress.vm    v21,v19,v0
        vmv1r.v v0,v2
        vle32.v v8,0(t6)
        addi    t5,s4,128
        vcompress.vm    v20,v10,v0
        vmv1r.v v0,v1
        vle32.v v17,0(t5)
[snip] ...this goes on for many pages!
        vcompress.vm    v8,v7,v0
        vslideup.vi     v10,v9,2
        vcompress.vm    v7,v6,v0
        vadd.vv v3,v3,v12
        vcompress.vm    v6,v5,v0
        vslideup.vi     v8,v7,2
        vadd.vv v3,v3,v10
        vcompress.vm    v7,v4,v0
        vcompress.vm    v5,v24,v0
        vadd.vv v3,v3,v8
        vslideup.vi     v6,v7,2
        vcompress.vm    v7,v22,v0
        vcompress.vm    v4,v20,v0
        vadd.vv v3,v3,v6
        vslideup.vi     v5,v7,2
        vcompress.vm    v6,v18,v0
        vadd.vv v3,v3,v5
        vslideup.vi     v4,v6,2
        vadd.vv v3,v3,v4
        vle32.v v4,0(sp)
        vadd.vv v3,v4,v3
        vse32.v v3,0(sp)
        bne     a3,s3,.L4
[snip] end of main loop.

Counting the number of insns in the loop, I'm seeing over 20 times the number
of instructions in this loop over the 8 element test case!

The original bw_mem test case in LMbench does 128 loads within the loop which
just exacerbates the issue even more.


I'm marking this as a target bug for now until we know more...

Reply via email to