https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117733

            Bug ID: 117733
           Summary: RISC-V SPEC2017 503.bwaves Inefficient fortran
                    multi-dimensional array access
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: vineetg at gcc dot gnu.org
                CC: jeffreyalaw at gmail dot com, rdapp at gcc dot gnu.org
  Target Milestone: ---

bwaves/cam4 has a bunch of fortran multi-dimensional array access and nested
loops to traverse them, and Vector codegen doesn't seem pretty and/or
efficient. 

I have a reduced test:

      subroutine shell(q,nx)

      implicit none

      integer nx,ny,nzl
      real(kind=8)  q(5,nx)
      real(kind=8) dqnorm
      integer l,i

      dqnorm = 0.0d0

               do i=1,nx
                  do l=1,5
                     dqnorm = dqnorm + q(l,i)*q(l,i)
                  enddo
               enddo

      call use_val(dqnorm)
      return
      end

-Ofast -ftree-vectorize -march=rv64gcv_zvl256b_zba_zbb_zbs
-mrvv-vector-bits=zvl -mabi=lp64d

The relevant output is not efficient

        vsetivli        zero,4,e64,m1,ta,ma
        vmv.v.i v2,0
        sh2add  a4,a4,a4
        addi    t5,a0,32
        vmv1r.v v3,v2
        vmv1r.v v4,v2
        vmv1r.v v1,v2
        vmv1r.v v5,v2
        addi    t4,a0,64
        addi    t3,a0,96
        addi    t1,a0,128
        li      t6,20
        li      a3,4
.L3:
        minu    a5,a4,t6
        minu    a7,a5,a3
        sub     a5,a5,a7
        minu    a6,a5,a3
        sub     a5,a5,a6
        minu    a1,a5,a3
        sub     a5,a5,a1
        vsetvli zero,a7,e64,m1,ta,ma
        vle64.v v10,0(a0)
        minu    a2,a5,a3
        vsetvli zero,a6,e64,m1,ta,ma
        vle64.v v9,0(t5)
        sub     a5,a5,a2
        vsetvli zero,a1,e64,m1,ta,ma
        vle64.v v8,0(t4)
        vsetvli zero,a5,e64,m1,ta,ma
        vle64.v v7,0(t1)
        vsetvli zero,a2,e64,m1,ta,ma
        vle64.v v6,0(t3)
        vsetvli zero,a7,e64,m1,tu,ma
        vfmacc.vv       v5,v10,v10
        vsetvli zero,a6,e64,m1,tu,ma
        vfmacc.vv       v1,v9,v9
        vsetvli zero,a1,e64,m1,tu,ma
        vfmacc.vv       v4,v8,v8
        vsetvli zero,a5,e64,m1,tu,ma
        vfmacc.vv       v2,v7,v7
        mv      t0,a4
        vsetvli zero,a2,e64,m1,tu,ma
        vfmacc.vv       v3,v6,v6
        addi    a0,a0,160
        addi    t5,t5,160
        addi    t4,t4,160
        addi    t1,t1,160
        addi    t3,t3,160
        addi    a4,a4,-20
        bgtu    t0,t6,.L3
...

(1) There is a VLE64 per element fetch / loop entry unrolled (but fortran is
column major, and elements accessed in inner loop are consecutive in memory.)
(2) Uses VL for predication which will runtime hit VL=0 which might be costly
on some uarches.
(3) There's all loads followed by all mac ops, vs. batching similar ops under
same VL.

Reply via email to