On Fri, 28 Aug 2020 at 17:27, Richard Biener <richard.guent...@gmail.com> wrote: > > On Fri, Aug 28, 2020 at 1:17 PM Prathamesh Kulkarni > <prathamesh.kulka...@linaro.org> wrote: > > > > On Wed, 26 Aug 2020 at 16:50, Richard Biener <richard.guent...@gmail.com> > > wrote: > > > > > > On Wed, Aug 26, 2020 at 12:34 PM Prathamesh Kulkarni via Gcc > > > <gcc@gcc.gnu.org> wrote: > > > > > > > > Hi, > > > > We're seeing a consistent regression >10% on calculix with -O2 -flto vs > > > > -O2 > > > > on aarch64 in our validation CI. I tried to investigate this issue a > > > > bit, and it seems the regression comes from inlining of orthonl into > > > > e_c3d. Disabling that brings back the performance. However, inlining > > > > orthonl into e_c3d, increases it's size from 3187 to 3837 by around > > > > 16.9% which isn't too large. > > > > > > > > I have attached two test-cases, e_c3d.f that has orthonl manually > > > > inlined into e_c3d to "simulate" LTO's inlining, and e_c3d-orig.f, > > > > which contains unmodified function. > > > > (gauss.f is included by e_c3d.f). For reproducing, just passing -O2 is > > > > sufficient. > > > > > > > > It seems that inlining orthonl, causes 20 hoistings into block 181, > > > > which are then hoisted to block 173, in particular hoistings of w(1, > > > > 1) ... w(3, 3), which wasn't > > > > possible without inlining. The hoistings happen because of basic block > > > > that computes orthonl in line 672 has w(1, 1) ... w(3, 3) and the > > > > following block in line 1035 in e_c3d.f: > > > > > > > > senergy= > > > > & (s11*w(1,1)+s12*(w(1,2)+w(2,1)) > > > > & +s13*(w(1,3)+w(3,1))+s22*w(2,2) > > > > & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight > > > > > > > > Disabling hoisting into blocks 173 (and 181), brings back most of the > > > > performance. I am not able to understand why (if?) these hoistings of > > > > w(1, 1) ... > > > > w(3, 3) are causing slowdown however. Looking at assembly, the hot > > > > code-path from perf in e_c3d shows following code-gen diff: > > > > For inlined version: > > > > .L122: > > > > ldr d15, [x1, -248] > > > > add w0, w0, 1 > > > > add x2, x2, 24 > > > > add x1, x1, 72 > > > > fmul d15, d17, d15 > > > > fmul d15, d15, d18 > > > > fmul d14, d15, d14 > > > > fmadd d16, d14, d31, d16 > > > > cmp w0, 4 > > > > beq .L121 > > > > ldr d14, [x2, -8] > > > > b .L122 > > > > > > > > and for non-inlined version: > > > > .L118: > > > > ldr d0, [x1, -248] > > > > add w0, w0, 1 > > > > ldr d2, [x2, -8] > > > > add x1, x1, 72 > > > > add x2, x2, 24 > > > > fmul d0, d3, d0 > > > > fmul d0, d0, d5 > > > > fmul d0, d0, d2 > > > > fmadd d1, d4, d0, d1 > > > > cmp w0, 4 > > > > bne .L118 > > > > > > I wonder if you have profles. The inlined version has a > > > non-empty latch block (looks like some PRE is happening > > > there?). Eventually your uarch does not like the close > > > (does your assembly show the layour as it is?) branches? > > Hi Richard, > > I have uploaded profiles obtained by perf here: > > -O2: https://people.linaro.org/~prathamesh.kulkarni/o2_perf.data > > -O2 -flto: https://people.linaro.org/~prathamesh.kulkarni/o2_lto_perf.data > > > > For the above loop, it shows the following: > > -O2: > > 0.01 │ f1c: ldur d0, [x1, #-248] > > 3.53 │ add w0, w0, #0x1 > > │ ldur d2, [x2, #-8] > > 3.54 │ add x1, x1, #0x48 > > │ add x2, x2, #0x18 > > 5.89 │ fmul d0, d3, d0 > > 14.12 │ fmul d0, d0, d5 > > 14.14 │ fmul d0, d0, d2 > > 14.13 │ fmadd d1, d4, d0, d1 > > 0.00 │ cmp w0, #0x4 > > 3.52 │ ↑ b.ne f1c > > > > -O2 -flto: > > 5.47 |1124: ldur d15, [x1, #-248] > > 2.19 │ add w0, w0, #0x1 > > 1.10 │ add x2, x2, #0x18 > > 2.18 │ add x1, x1, #0x48 > > 4.37 │ fmul d15, d17, d15 > > 13.13 │ fmul d15, d15, d18 > > 13.13 │ fmul d14, d15, d14 > > 13.14 │ fmadd d16, d14, d31, d16 > > │ cmp w0, #0x4 > > 3.28 │ ↓ b.eq 1154 > > 0.00 │ ldur d14, [x2, #-8] > > 2.19 │ ↑ b 1124 > > > > IIUC, the biggest relative difference comes from load [x1, #-248] > > which in LTO's case takes 5.47% of overall samples: > > 5.47 |1124: ldur d15, [x1, #-248] > > while in case of -O2, it's just 0.01: > > 0.01 │ f1c: ldur d0, [x1, #-248] > > > > I wonder if that's (one of) the main factor(s) behind slowdown or it's > > not too relevant ? > > This looks more like the branch since usually branch costs > are attributed to the target rather than the branch itself. You could > try re-ordering the code so the loop entry jumps around the > latch which can then fall thru so see if that makes a difference. Thanks for the suggestions. Is it possible to modify assembly files emitted after ltrans phase ? IIUC, the linker invokes lto1 twice, for wpa and ltrans,and then links the obtained object files which doesn't make it possible to hand edit assembly files post ltrans ? In particular, I wanted to modify calculix.ltrans16.ltrans.s, which contains e_c3d to avoid the extra branch. (If that doesn't work out, I can proceed with manually inlining in the source and then modifying generated assembly).
Thanks, Prathamesh > > Richard. > > > Thanks, > > Prathamesh > > > > > > > which corresponds to the following loop in line 1014. > > > > do n1=1,3 > > > > s(iii1,jjj1)=s(iii1,jjj1) > > > > & +anisox(m1,k1,n1,l1) > > > > & *w(k1,l1)*vo(i1,m1)*vo(j1,n1) > > > > & *weight > > > > > > > > I am not sure why would hoisting have any direct effect on this loop > > > > except perhaps that hoisting allocated more reigsters, and led to > > > > increased register pressure. Perhaps that's why it's using highered > > > > number regs for code-gen in inlined version ? However disabling > > > > hoisting in blocks 173 and 181, also leads to overall 6 extra spills > > > > (by grepping for str to sp), so > > > > hoisting is also helping here ? I am not sure how to proceed further, > > > > and would be grateful for suggestions. > > > > > > > > Thanks, > > > > Prathamesh