On 08/12/2016 04:28 PM, Michael Matz wrote:
Hi,
On Fri, 12 Aug 2016, Denys Vlasenko wrote:
Have you tested the performance impact of your patch? Note that the
macro you changed is used for function and code label alignment. So,
unless I misunderstand something that means that if the large
alignment can't be achieved for e.g. a loop start label, you won't
align it at all anymore. This should be fairly catastrophic for any
loopy benchmark, so anything like this would have to be checked on a
couple benchmarks from cpu2000 (possibly cpu2006), which has some that
are extremely alignment sensitive.
Even for function labels I'd find no alignment at all strange, and I
don't see why you'd want this.
For many generations now, x86 CPUs have at least 32, and usually 64 byte
cachelines. Decoders fetch instructions in blocks of 32 or 64 bytes. Not
less. Instructions which are "misaligned" (for example, starting at byte
5) within a cacheline but still fitting into one cacheline are fetched
in one go, with no penalty.
Yes, I know all that. Fetching is one thing. Loop cache is for instance
another (more important) thing. Not aligning the loop head increases
chance of the whole loop being split over more cache lines than necessary.
Jump predictors also don't necessarily decode/remember the whole
instruction address. And so on.
Aligning to 8 bytes within a cacheline does not speed things up. It
simply wastes bytes without speeding up anything.
It's not that easy, which is why I have asked if you have _measured_ the
correctness of your theory of it not mattering? All the alignment
adjustments in GCC were included after measurements. In particular the
align-by-8-always (for loop heads) was included after some large
regressions on cpu2000, in 2007 (core2 duo at that time).
So, I'm never much thrilled about listing reasons for why performance
can't possibly be affected, especially when we know that it once _was_
affected, when there's an easy way to show that it's not affected.
z.S:
#compile with: gcc -nostartfiles -nostdlib
_start: .globl _start
.p2align 8
mov $4000*1000*1000, %eax # 5-byte insn
nop # 6
nop # 7
nop # 8
loop: dec %eax
lea (%ebx), %ebx
jnz loop
push $0
ret # SEGV
This program loops 4 billion times, then exits (by crashing).
I build two executables from it, z8 as shown above, which has its loop 8-byte
aligned:
$ objdump -dr z8
z8: file format elf64-x86-64
Disassembly of section .text:
0000000000400100 <_start>:
400100: b8 00 28 6b ee mov $0xee6b2800,%eax
400105: 90 nop
400106: 90 nop
400107: 90 nop
0000000000400108 <loop>:
400108: ff c8 dec %eax
40010a: 67 8d 1b lea (%ebx),%ebx
40010d: 75 f9 jne 400108 <loop>
40010f: 6a 00 pushq $0x0
400111: c3 retq
and z7, which has one NOP removed and therefore its loop starts
at 0000000000400107.
$ perf stat -r20 ./z7
Performance counter stats for './z7' (20 runs):
1204.217409 task-clock (msec) # 0.972 CPUs utilized
( +- 0.19% )
10 context-switches # 0.009 K/sec
( +- 15.69% )
0 cpu-migrations # 0.000 K/sec
( +- 77.80% )
3 page-faults # 0.003 K/sec
( +- 2.87% )
4,220,236,037 cycles # 3.505 GHz
( +- 0.20% )
12,030,574,486 instructions # 2.85 insn per cycle
( +- 0.00% )
4,005,827,208 branches # 3326.498 M/sec
( +- 0.00% )
22,338 branch-misses # 0.00% of all branches
( +- 4.10% )
1.238638386 seconds time elapsed
( +- 0.19% )
$ perf stat -r20 ./z8
Performance counter stats for './z8' (20 runs):
1203.453938 task-clock (msec) # 0.973 CPUs utilized
( +- 0.27% )
8 context-switches # 0.007 K/sec
( +- 14.46% )
0 cpu-migrations # 0.000 K/sec
( +- 54.61% )
3 page-faults # 0.003 K/sec
( +- 2.60% )
4,233,994,227 cycles # 3.518 GHz
( +- 0.27% )
12,030,085,275 instructions # 2.84 insn per cycle
( +- 0.00% )
4,005,715,106 branches # 3328.516 M/sec
( +- 0.00% )
21,486 branch-misses # 0.00% of all branches
( +- 4.42% )
1.236360951 seconds time elapsed
( +- 0.26% )
z8 is 0.2% faster. Lets try another run?
Performance counter stats for './z7' (20 runs):
1217.476778 task-clock (msec) # 0.972 CPUs utilized
( +- 0.30% )
8 context-switches # 0.006 K/sec
( +- 10.98% )
0 cpu-migrations # 0.000 K/sec
( +- 27.14% )
3 page-faults # 0.003 K/sec
( +- 3.06% )
4,252,346,035 cycles # 3.493 GHz
( +- 0.17% )
12,030,474,923 instructions # 2.83 insn per cycle
( +- 0.00% )
4,005,793,752 branches # 3290.242 M/sec
( +- 0.00% )
22,640 branch-misses # 0.00% of all branches
( +- 6.52% )
1.252268537 seconds time elapsed
( +- 0.32% )
Performance counter stats for './z8' (20 runs):
1220.024012 task-clock (msec) # 0.973 CPUs utilized
( +- 0.35% )
8 context-switches # 0.006 K/sec
( +- 12.55% )
0 cpu-migrations # 0.000 K/sec
( +- 39.74% )
3 page-faults # 0.003 K/sec
( +- 2.87% )
4,247,690,562 cycles # 3.482 GHz
( +- 0.27% )
12,032,460,554 instructions # 2.83 insn per cycle
( +- 0.01% )
4,006,219,524 branches # 3283.722 M/sec
( +- 0.01% )
26,651 branch-misses # 0.00% of all branches
( +- 7.73% )
1.253366584 seconds time elapsed
( +- 0.36% )
Now z7 is 0.1% faster.
Looks like loop alignment to 8 bytes does not matter (in this particular
example).