Re: LTO slows down calculix by more than 10% on aarch64

Prathamesh Kulkarni via Gcc Mon, 31 Aug 2020 04:22:39 -0700

On Fri, 28 Aug 2020 at 17:27, Richard Biener <richard.guent...@gmail.com> wrote:
>
> On Fri, Aug 28, 2020 at 1:17 PM Prathamesh Kulkarni
> <prathamesh.kulka...@linaro.org> wrote:
> >
> > On Wed, 26 Aug 2020 at 16:50, Richard Biener <richard.guent...@gmail.com> 
> > wrote:
> > >
> > > On Wed, Aug 26, 2020 at 12:34 PM Prathamesh Kulkarni via Gcc
> > > <gcc@gcc.gnu.org> wrote:
> > > >
> > > > Hi,
> > > > We're seeing a consistent regression >10% on calculix with -O2 -flto vs 
> > > > -O2
> > > > on aarch64 in our validation CI. I tried to investigate this issue a
> > > > bit, and it seems the regression comes from inlining of orthonl into
> > > > e_c3d. Disabling that brings back the performance. However, inlining
> > > > orthonl into e_c3d, increases it's size from 3187 to 3837 by around
> > > > 16.9% which isn't too large.
> > > >
> > > > I have attached two test-cases, e_c3d.f that has orthonl manually
> > > > inlined into e_c3d to "simulate" LTO's inlining, and e_c3d-orig.f,
> > > > which contains unmodified function.
> > > > (gauss.f is included by e_c3d.f). For reproducing, just passing -O2 is
> > > > sufficient.
> > > >
> > > > It seems that inlining orthonl, causes 20 hoistings into block 181,
> > > > which are then hoisted to block 173, in particular hoistings of w(1,
> > > > 1) ... w(3, 3), which wasn't
> > > > possible without inlining. The hoistings happen because of basic block
> > > > that computes orthonl in line 672 has w(1, 1) ... w(3, 3) and the
> > > > following block in line 1035 in e_c3d.f:
> > > >
> > > > senergy=
> > > >      &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > >      &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > >      &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > >
> > > > Disabling hoisting into blocks 173 (and 181), brings back most of the
> > > > performance. I am not able to understand why (if?) these hoistings of
> > > > w(1, 1) ...
> > > > w(3, 3) are causing slowdown however. Looking at assembly, the hot
> > > > code-path from perf in e_c3d shows following code-gen diff:
> > > > For inlined version:
> > > > .L122:
> > > >         ldr     d15, [x1, -248]
> > > >         add     w0, w0, 1
> > > >         add     x2, x2, 24
> > > >         add     x1, x1, 72
> > > >         fmul    d15, d17, d15
> > > >         fmul    d15, d15, d18
> > > >         fmul    d14, d15, d14
> > > >         fmadd   d16, d14, d31, d16
> > > >         cmp     w0, 4
> > > >         beq     .L121
> > > >         ldr     d14, [x2, -8]
> > > >         b       .L122
> > > >
> > > > and for non-inlined version:
> > > > .L118:
> > > >         ldr     d0, [x1, -248]
> > > >         add     w0, w0, 1
> > > >         ldr     d2, [x2, -8]
> > > >         add     x1, x1, 72
> > > >         add     x2, x2, 24
> > > >         fmul    d0, d3, d0
> > > >         fmul    d0, d0, d5
> > > >         fmul    d0, d0, d2
> > > >         fmadd   d1, d4, d0, d1
> > > >         cmp     w0, 4
> > > >         bne     .L118
> > >
> > > I wonder if you have profles.  The inlined version has a
> > > non-empty latch block (looks like some PRE is happening
> > > there?).  Eventually your uarch does not like the close
> > > (does your assembly show the layour as it is?) branches?
> > Hi Richard,
> > I have uploaded profiles obtained by perf here:
> > -O2: https://people.linaro.org/~prathamesh.kulkarni/o2_perf.data
> > -O2 -flto: https://people.linaro.org/~prathamesh.kulkarni/o2_lto_perf.data
> >
> > For the above loop, it shows the following:
> > -O2:
> >   0.01 │ f1c:  ldur   d0, [x1, #-248]
> >   3.53 │        add    w0, w0, #0x1
> >           │        ldur   d2, [x2, #-8]
> >   3.54 │        add    x1, x1, #0x48
> >           │        add    x2, x2, #0x18
> >   5.89 │        fmul   d0, d3, d0
> > 14.12 │        fmul   d0, d0, d5
> > 14.14 │        fmul   d0, d0, d2
> > 14.13 │        fmadd  d1, d4, d0, d1
> >   0.00 │        cmp    w0, #0x4
> >   3.52 │      ↑ b.ne   f1c
> >
> > -O2 -flto:
> >   5.47  |1124:    ldur   d15, [x1, #-248]
> >   2.19  │            add    w0, w0, #0x1
> >   1.10  │            add    x2, x2, #0x18
> >   2.18  │            add    x1, x1, #0x48
> >   4.37  │            fmul   d15, d17, d15
> >  13.13 │            fmul   d15, d15, d18
> >  13.13 │            fmul   d14, d15, d14
> >  13.14 │            fmadd  d16, d14, d31, d16
> >            │            cmp    w0, #0x4
> >   3.28  │            ↓ b.eq   1154
> >   0.00  │            ldur   d14, [x2, #-8]
> >   2.19  │            ↑ b      1124
> >
> > IIUC, the biggest relative difference comes from load [x1, #-248]
> > which in LTO's case takes 5.47% of overall samples:
> > 5.47  |1124:   ldur   d15, [x1, #-248]
> > while in case of -O2, it's just 0.01:
> >  0.01 │ f1c:   ldur   d0, [x1, #-248]
> >
> > I wonder if that's (one of) the main factor(s) behind slowdown or it's
> > not too relevant ?
>
> This looks more like the branch since usually branch costs
> are attributed to the target rather than the branch itself.  You could
> try re-ordering the code so the loop entry jumps around the
> latch which can then fall thru so see if that makes a difference.
Thanks for the suggestions.
Is it possible to modify assembly files emitted after ltrans phase ?
IIUC, the linker invokes lto1 twice, for wpa and ltrans,and then links
the obtained object files which doesn't make it possible to hand edit
assembly files post ltrans ?
In particular, I wanted to modify calculix.ltrans16.ltrans.s, which
contains e_c3d to avoid the extra branch.
(If that doesn't work out, I can proceed with manually inlining in the
source and then modifying generated assembly).


Thanks,
Prathamesh
>
> Richard.
>
> > Thanks,
> > Prathamesh
> > >
> > > > which corresponds to the following loop in line 1014.
> > > >                                 do n1=1,3
> > > >                                   s(iii1,jjj1)=s(iii1,jjj1)
> > > >      &                                  +anisox(m1,k1,n1,l1)
> > > >      &                                  *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
> > > >      &                                  *weight
> > > >
> > > > I am not sure why would hoisting have any direct effect on this loop
> > > > except perhaps that hoisting allocated more reigsters, and led to
> > > > increased register pressure. Perhaps that's why it's using highered
> > > > number regs for code-gen in inlined version ? However disabling
> > > > hoisting in blocks 173 and 181, also leads to overall 6 extra spills
> > > > (by grepping for str to sp), so
> > > > hoisting is also helping here ? I am not sure how to proceed further,
> > > > and would be grateful for suggestions.
> > > >
> > > > Thanks,
> > > > Prathamesh

Re: LTO slows down calculix by more than 10% on aarch64

Reply via email to