On Mon, 31 Aug 2020 at 16:53, Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> wrote: > > On Fri, 28 Aug 2020 at 17:33, Alexander Monakov <amona...@ispras.ru> wrote: > > > > On Fri, 28 Aug 2020, Prathamesh Kulkarni via Gcc wrote: > > > > > I wonder if that's (one of) the main factor(s) behind slowdown or it's > > > not too relevant ? > > > > Probably not. Some advice to make your search more directed: > > > > Pass '-n' to 'perf report'. Relative sample ratios are hard to reason about > > when they are computed against different bases, it's much easier to see that > > a loop is slowing down if it went from 4000 to 4500 in absolute sample count > > as opposed to 90% to 91% in relative sample ratio. > > > > Before diving down 'perf report', be sure to fully account for differences > > in 'perf stat' output. Do the programs execute the same number of > > instructions, > > so the difference only in scheduling? Do the programs suffer from the same > > amount of branch mispredictions? Please show output of 'perf stat' on the > > mailing list too, so everyone is on the same page about that. > > > > I also suspect that the dramatic slowdown has to do with the extra branch. > > Your CPU might have some specialized counters for branch prediction, see > > 'perf list'. > Hi Alexander, > Thanks for the suggestions! I am in the process of doing the > benchmarking experiments, > and will post the results soon. Hi, I obtained perf stat results for following benchmark runs:
-O2: 7856832.692380 task-clock (msec) # 1.000 CPUs utilized 3758 context-switches # 0.000 K/sec 40 cpu-migrations # 0.000 K/sec 40847 page-faults # 0.005 K/sec 7856782413676 cycles # 1.000 GHz 6034510093417 instructions # 0.77 insn per cycle 363937274287 branches # 46.321 M/sec 48557110132 branch-misses # 13.34% of all branches -O2 with orthonl inlined: 8319643.114380 task-clock (msec) # 1.000 CPUs utilized 4285 context-switches # 0.001 K/sec 28 cpu-migrations # 0.000 K/sec 40843 page-faults # 0.005 K/sec 8319591038295 cycles # 1.000 GHz 6276338800377 instructions # 0.75 insn per cycle 467400726106 branches # 56.180 M/sec 45986364011 branch-misses # 9.84% of all branches -O2 with orthonl inlined and PRE disabled (this removes the extra branches): 8207331.088040 task-clock (msec) # 1.000 CPUs utilized 2266 context-switches # 0.000 K/sec 32 cpu-migrations # 0.000 K/sec 40846 page-faults # 0.005 K/sec 8207292032467 cycles # 1.000 GHz 6035724436440 instructions # 0.74 insn per cycle 364415440156 branches # 44.401 M/sec 53138327276 branch-misses # 14.58% of all branches -O2 with orthonl inlined and hoisting disabled: 7797265.206850 task-clock (msec) # 1.000 CPUs utilized 3139 context-switches # 0.000 K/sec 20 cpu-migrations # 0.000 K/sec 40846 page-faults # 0.005 K/sec 7797221351467 cycles # 1.000 GHz 6187348757324 instructions # 0.79 insn per cycle 461840800061 branches # 59.231 M/sec 26920311761 branch-misses # 5.83% of all branches Perf profiles for -O2 -fno-code-hoisting and inlined orthonl: https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data 3196866 |1f04: ldur d1, [x1, #-248] 216348301800│ add w0, w0, #0x1 985098 | add x2, x2, #0x18 216215999206│ add x1, x1, #0x48 215630376504│ fmul d1, d5, d1 863829148015│ fmul d1, d1, d6 864228353526│ fmul d0, d1, d0 864568163014│ fmadd d2, d0, d16, d2 │ cmp w0, #0x4 216125427594│ ↓ b.eq 1f34 15010377│ ldur d0, [x2, #-8] 143753737468│ ↑ b 1f04 -O2 with inlined orthonl: https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data 359871503840│ 1ef8: ldur d15, [x1, #-248] 144055883055│ add w0, w0, #0x1 72262104254│ add x2, x2, #0x18 143991169721│ add x1, x1, #0x48 288648917780│ fmul d15, d17, d15 864665644756│ fmul d15, d15, d18 863868426387│ fmul d14, d15, d14 865228159813│ fmadd d16, d14, d31, d16 245967│ cmp w0, #0x4 215396760545│ ↓ b.eq 1f28 704732365│ ldur d14, [x2, #-8] 143775979620│ ↑ b 1ef8 AFAIU, (a) Disabling PRE, results in removal of extra branch around the loop, but that results only in slight performance increase (around 1.3%). (b) Disabling hoisting brings back performance to (slightly more than) -O2 without inlining orthonl. The generated code for the loop, has similar layout as -O2 with inlined orthonl, but uses low numbered regs. Again, not sure if it's relevant, the load from [x1, #-248] seems to take much lesser time with hoisting disabled. I tried to check if this was possibly an alignment issue but that seems not to be the case because in both cases (with / without hoisting) address pointed to by x1 was aligned properly, and only with a difference of 32 bytes between both cases. Thanks, Prathamesh > > Thanks, > Prathamesh > > > > Alexander