Re: LTO slows down calculix by more than 10% on aarch64

Prathamesh Kulkarni via Gcc Fri, 04 Sep 2020 02:53:14 -0700

On Mon, 31 Aug 2020 at 16:53, Prathamesh Kulkarni
<prathamesh.kulka...@linaro.org> wrote:
>
> On Fri, 28 Aug 2020 at 17:33, Alexander Monakov <amona...@ispras.ru> wrote:
> >
> > On Fri, 28 Aug 2020, Prathamesh Kulkarni via Gcc wrote:
> >
> > > I wonder if that's (one of) the main factor(s) behind slowdown or it's
> > > not too relevant ?
> >
> > Probably not. Some advice to make your search more directed:
> >
> > Pass '-n' to 'perf report'. Relative sample ratios are hard to reason about
> > when they are computed against different bases, it's much easier to see that
> > a loop is slowing down if it went from 4000 to 4500 in absolute sample count
> > as opposed to 90% to 91% in relative sample ratio.
> >
> > Before diving down 'perf report', be sure to fully account for differences
> > in 'perf stat' output. Do the programs execute the same number of 
> > instructions,
> > so the difference only in scheduling? Do the programs suffer from the same
> > amount of branch mispredictions? Please show output of 'perf stat' on the
> > mailing list too, so everyone is on the same page about that.
> >
> > I also suspect that the dramatic slowdown has to do with the extra branch.
> > Your CPU might have some specialized counters for branch prediction, see
> > 'perf list'.
> Hi Alexander,
> Thanks for the suggestions! I am in the process of doing the
> benchmarking experiments,
> and will post the results soon.
Hi,
I obtained perf stat results for following benchmark runs:


-O2:

    7856832.692380      task-clock (msec)         #    1.000 CPUs utilized
              3758               context-switches          #    0.000 K/sec
                40                 cpu-migrations             #    0.000 K/sec
             40847              page-faults                   #    0.005 K/sec
     7856782413676      cycles                           #    1.000 GHz
     6034510093417      instructions                   #    0.77  insn per cycle
      363937274287       branches                       #   46.321 M/sec
       48557110132       branch-misses                #   13.34% of all branches

-O2 with orthonl inlined:

    8319643.114380      task-clock (msec)       #    1.000 CPUs utilized
              4285               context-switches         #    0.001 K/sec
                28                 cpu-migrations            #    0.000 K/sec
             40843              page-faults                  #    0.005 K/sec
     8319591038295      cycles                          #    1.000 GHz
     6276338800377      instructions                  #    0.75  insn per cycle
      467400726106       branches                      #   56.180 M/sec
       45986364011        branch-misses              #    9.84% of all branches

-O2 with orthonl inlined and PRE disabled (this removes the extra branches):

   8207331.088040      task-clock (msec)   #    1.000 CPUs utilized
              2266               context-switches    #    0.000 K/sec
                32                 cpu-migrations       #    0.000 K/sec
             40846              page-faults             #    0.005 K/sec
     8207292032467      cycles                     #   1.000 GHz
     6035724436440      instructions             #    0.74  insn per cycle
      364415440156       branches                 #   44.401 M/sec
       53138327276        branch-misses         #   14.58% of all branches

-O2 with orthonl inlined and hoisting disabled:

   7797265.206850      task-clock (msec)         #    1.000 CPUs utilized
              3139              context-switches          #    0.000 K/sec
                20                cpu-migrations             #    0.000 K/sec
             40846              page-faults                  #    0.005 K/sec
     7797221351467      cycles                          #    1.000 GHz
     6187348757324      instructions                  #    0.79  insn per cycle
      461840800061       branches                      #   59.231 M/sec
       26920311761        branch-misses             #    5.83% of all branches

Perf profiles for
-O2 -fno-code-hoisting and inlined orthonl:
https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data

          3196866 |1f04:    ldur   d1, [x1, #-248]
216348301800│            add    w0, w0, #0x1
            985098 |            add    x2, x2, #0x18
216215999206│            add    x1, x1, #0x48
215630376504│            fmul   d1, d5, d1
863829148015│            fmul   d1, d1, d6
864228353526│            fmul   d0, d1, d0
864568163014│            fmadd  d2, d0, d16, d2
                        │             cmp    w0, #0x4
216125427594│          ↓ b.eq   1f34
        15010377│             ldur   d0, [x2, #-8]
143753737468│          ↑ b      1f04

-O2 with inlined orthonl:
https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data

359871503840│ 1ef8:   ldur   d15, [x1, #-248]
144055883055│            add    w0, w0, #0x1
  72262104254│            add    x2, x2, #0x18
143991169721│            add    x1, x1, #0x48
288648917780│            fmul   d15, d17, d15
864665644756│            fmul   d15, d15, d18
863868426387│            fmul   d14, d15, d14
865228159813│            fmadd  d16, d14, d31, d16
            245967│            cmp    w0, #0x4
215396760545│         ↓ b.eq   1f28
      704732365│            ldur   d14, [x2, #-8]
143775979620│         ↑ b      1ef8

AFAIU,
(a) Disabling PRE, results in removal of extra branch around the loop,
but that results only in slight performance increase (around 1.3%).

(b) Disabling hoisting brings back performance to (slightly more than)
-O2 without inlining orthonl. The generated code for the loop, has
similar layout as -O2 with inlined orthonl, but uses low numbered
regs. Again, not sure if it's relevant, the load from [x1, #-248]
seems to take much lesser time with hoisting disabled. I tried to
check if this was possibly an alignment issue but that seems not to be
the case because in both cases (with / without hoisting) address
pointed to by x1 was aligned properly, and only with a difference of
32 bytes between both cases.

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Alexander

Re: LTO slows down calculix by more than 10% on aarch64

Reply via email to