On Mon, 31 Aug 2020 at 16:53, Prathamesh Kulkarni
<[email protected]> wrote:
>
> On Fri, 28 Aug 2020 at 17:33, Alexander Monakov <[email protected]> wrote:
> >
> > On Fri, 28 Aug 2020, Prathamesh Kulkarni via Gcc wrote:
> >
> > > I wonder if that's (one of) the main factor(s) behind slowdown or it's
> > > not too relevant ?
> >
> > Probably not. Some advice to make your search more directed:
> >
> > Pass '-n' to 'perf report'. Relative sample ratios are hard to reason about
> > when they are computed against different bases, it's much easier to see that
> > a loop is slowing down if it went from 4000 to 4500 in absolute sample count
> > as opposed to 90% to 91% in relative sample ratio.
> >
> > Before diving down 'perf report', be sure to fully account for differences
> > in 'perf stat' output. Do the programs execute the same number of
> > instructions,
> > so the difference only in scheduling? Do the programs suffer from the same
> > amount of branch mispredictions? Please show output of 'perf stat' on the
> > mailing list too, so everyone is on the same page about that.
> >
> > I also suspect that the dramatic slowdown has to do with the extra branch.
> > Your CPU might have some specialized counters for branch prediction, see
> > 'perf list'.
> Hi Alexander,
> Thanks for the suggestions! I am in the process of doing the
> benchmarking experiments,
> and will post the results soon.
Hi,
I obtained perf stat results for following benchmark runs:
-O2:
7856832.692380 task-clock (msec) # 1.000 CPUs utilized
3758 context-switches # 0.000 K/sec
40 cpu-migrations # 0.000 K/sec
40847 page-faults # 0.005 K/sec
7856782413676 cycles # 1.000 GHz
6034510093417 instructions # 0.77 insn per cycle
363937274287 branches # 46.321 M/sec
48557110132 branch-misses # 13.34% of all branches
-O2 with orthonl inlined:
8319643.114380 task-clock (msec) # 1.000 CPUs utilized
4285 context-switches # 0.001 K/sec
28 cpu-migrations # 0.000 K/sec
40843 page-faults # 0.005 K/sec
8319591038295 cycles # 1.000 GHz
6276338800377 instructions # 0.75 insn per cycle
467400726106 branches # 56.180 M/sec
45986364011 branch-misses # 9.84% of all branches
-O2 with orthonl inlined and PRE disabled (this removes the extra branches):
8207331.088040 task-clock (msec) # 1.000 CPUs utilized
2266 context-switches # 0.000 K/sec
32 cpu-migrations # 0.000 K/sec
40846 page-faults # 0.005 K/sec
8207292032467 cycles # 1.000 GHz
6035724436440 instructions # 0.74 insn per cycle
364415440156 branches # 44.401 M/sec
53138327276 branch-misses # 14.58% of all branches
-O2 with orthonl inlined and hoisting disabled:
7797265.206850 task-clock (msec) # 1.000 CPUs utilized
3139 context-switches # 0.000 K/sec
20 cpu-migrations # 0.000 K/sec
40846 page-faults # 0.005 K/sec
7797221351467 cycles # 1.000 GHz
6187348757324 instructions # 0.79 insn per cycle
461840800061 branches # 59.231 M/sec
26920311761 branch-misses # 5.83% of all branches
Perf profiles for
-O2 -fno-code-hoisting and inlined orthonl:
https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
3196866 |1f04: ldur d1, [x1, #-248]
216348301800│ add w0, w0, #0x1
985098 | add x2, x2, #0x18
216215999206│ add x1, x1, #0x48
215630376504│ fmul d1, d5, d1
863829148015│ fmul d1, d1, d6
864228353526│ fmul d0, d1, d0
864568163014│ fmadd d2, d0, d16, d2
│ cmp w0, #0x4
216125427594│ ↓ b.eq 1f34
15010377│ ldur d0, [x2, #-8]
143753737468│ ↑ b 1f04
-O2 with inlined orthonl:
https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
359871503840│ 1ef8: ldur d15, [x1, #-248]
144055883055│ add w0, w0, #0x1
72262104254│ add x2, x2, #0x18
143991169721│ add x1, x1, #0x48
288648917780│ fmul d15, d17, d15
864665644756│ fmul d15, d15, d18
863868426387│ fmul d14, d15, d14
865228159813│ fmadd d16, d14, d31, d16
245967│ cmp w0, #0x4
215396760545│ ↓ b.eq 1f28
704732365│ ldur d14, [x2, #-8]
143775979620│ ↑ b 1ef8
AFAIU,
(a) Disabling PRE, results in removal of extra branch around the loop,
but that results only in slight performance increase (around 1.3%).
(b) Disabling hoisting brings back performance to (slightly more than)
-O2 without inlining orthonl. The generated code for the loop, has
similar layout as -O2 with inlined orthonl, but uses low numbered
regs. Again, not sure if it's relevant, the load from [x1, #-248]
seems to take much lesser time with hoisting disabled. I tried to
check if this was possibly an alignment issue but that seems not to be
the case because in both cases (with / without hoisting) address
pointed to by x1 was aligned properly, and only with a difference of
32 bytes between both cases.
Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Alexander