On Mon, Sep 17, 2018 at 10:50:58AM +0200, Ilya Leoshkevich wrote: > > Am 14.09.2018 um 23:35 schrieb Segher Boessenkool > > <seg...@kernel.crashing.org>: > > Could you please show generated code before and after this patch? > > I mean generated assembler code. What -S gives you. > > Before: > > foo4: > .LFB0: > .cfi_startproc > lt %r1,0(%r2) > jne .L2 > lhi %r3,1 > cs %r1,%r3,0(%r2) > .L2: > jne .L5 > br %r14 > .L5: > jg bar > > After: > > foo4: > .LFB0: > .cfi_startproc > lt %r1,0(%r2) > jne .L4 > lhi %r3,1 > cs %r1,%r3,0(%r2) > jne .L4 > br %r14 > .L4: > jg bar
Ah. And a compiler of some weeks old gives foo4: .LFB0: .cfi_startproc lhi %r3,0 lhi %r4,1 cs %r3,%r4,0(%r2) jne .L4 br %r14 .L4: jg bar so this is all caused by the recent optimisation that avoids the "cs" if it can. > > Why does the existing jump threading not work for you; should it happen > > at another time? > > We call cleanup_cfg (CLEANUP_THREADING) only once - during the „jump“ > pass, which happens before combine. There is also „jump2“ pass, which > happens afterwards, and after discussion with Ulrich Weigand I tried to > move jump threading there. While this change had the desired effect on > the testcase, the code got worse in another places. Example from > GemsFDTD: > > Before: > > 103e96c: b9 04 00 31 lgr %r3,%r1 > 103e970: a7 18 00 00 lhi %r1,0 > 103e974: e3 20 f0 d0 00 0d dsg %r2,208(%r15) > 103e97a: b9 20 00 c3 cgr %r12,%r3 > 103e97e: a7 29 ff ff lghi %r2,-1 > 103e982: ec 12 00 01 00 42 lochih %r1,1 > 103e988: e3 20 f0 f8 00 82 xg %r2,248(%r15) > 103e98e: 1a 15 ar %r1,%r5 > 103e990: b9 e9 c0 72 sgrk %r7,%r2,%r12 > 103e994: b3 c1 00 e2 ldgr %f14,%r2 > 103e998: b9 f8 a0 21 ark %r2,%r1,%r10 > 103e99c: 12 99 ltr %r9,%r9 > 103e99e: a7 18 00 00 lhi %r1,0 > 103e9a2: ec 1c 00 01 00 42 lochile %r1,1 > 103e9a8: 50 10 f1 28 st %r1,296(%r15) > 103e9ac: c2 9d 00 00 00 00 cfi %r9,0 > 103e9b2: c0 c4 00 01 28 4e jgle 1063a4e > 103e9b8: e5 4c f1 08 00 00 mvhi 264(%r15),0 > 103e9be: eb 14 00 03 00 0d sllg %r1,%r4,3 > > After: > > 103e96c: b9 04 00 31 lgr %r3,%r1 > 103e970: e5 4c f1 18 00 01 mvhi 280(%r15),1 > 103e976: a7 18 00 00 lhi %r1,0 > 103e97a: e3 20 f0 d0 00 0d dsg %r2,208(%r15) > 103e980: b9 20 00 c3 cgr %r12,%r3 > 103e984: a7 29 ff ff lghi %r2,-1 > 103e988: ec 12 00 01 00 42 lochih %r1,1 > 103e98e: e3 20 f0 f8 00 82 xg %r2,248(%r15) > 103e994: 1a 15 ar %r1,%r5 > 103e996: b3 c1 00 e2 ldgr %f14,%r2 > 103e99a: b9 e9 c0 72 sgrk %r7,%r2,%r12 > 103e99e: b9 f8 a0 21 ark %r2,%r1,%r10 > 103e9a2: c2 9d 00 00 00 00 cfi %r9,0 > 103e9a8: c0 c4 00 01 28 51 jgle 1063a4a > 103e9ae: e5 4c f1 18 00 00 mvhi 280(%r15),0 > 103e9b4: e5 4c f1 08 00 00 mvhi 264(%r15),0 > 103e9ba: eb 14 00 03 00 0d sllg %r1,%r4,3 > > Diff: > > lgr %rn,%rn > - mvhi m(%rx),1 > lhi %rn,0 > > ar %rn,%rn > - ldgr %fn,%rn > sgrk %rn,%rn,%rn > + ldgr %fn,%rn > ark %rn,%rn,%rn > + ltr %rn,%rn > + lhi %rn,0 > + lochile %rn,1 > + st %rn,m(%rx) > cfi %rn,0 > > mvhi m(%rx),0 > - mvhi m(%rx),0 > sllg %rn,%rn,3 > > The code used to compare %r9 with 0, jump based on the comparison > result, and set the local variable in the branch. Now it compares %r9 > with 0 twice - I didn’t look at the details yet, but apparently we’re > missing some optimization because jump threading is done too late. Yeah, jump2 is quite late. I think you should do it before postreload_cse or similar. Btw, what percentage of files (or subroutines) does jump threading run after combine, with your patch? Segher