On Fri, Aug 12, 2016 at 9:00 PM, Denys Vlasenko <dvlas...@redhat.com> wrote: > On 08/12/2016 05:20 PM, Denys Vlasenko wrote: >>> >>> Yes, I know all that. Fetching is one thing. Loop cache is for instance >>> another (more important) thing. Not aligning the loop head increases >>> chance of the whole loop being split over more cache lines than >>> necessary. >>> Jump predictors also don't necessarily decode/remember the whole >>> instruction address. And so on. >>> >>>> Aligning to 8 bytes within a cacheline does not speed things up. It >>>> simply wastes bytes without speeding up anything. >>> >>> >>> It's not that easy, which is why I have asked if you have _measured_ the >>> correctness of your theory of it not mattering? All the alignment >>> adjustments in GCC were included after measurements. In particular the >>> align-by-8-always (for loop heads) was included after some large >>> regressions on cpu2000, in 2007 (core2 duo at that time). >>> >>> So, I'm never much thrilled about listing reasons for why performance >>> can't possibly be affected, especially when we know that it once _was_ >>> affected, when there's an easy way to show that it's not affected. >> >> >> z.S: >> >> #compile with: gcc -nostartfiles -nostdlib >> _start: .globl _start >> .p2align 8 >> mov $4000*1000*1000, %eax # 5-byte insn >> nop # 6 >> nop # 7 >> nop # 8 >> loop: dec %eax >> lea (%ebx), %ebx >> jnz loop >> push $0 >> ret # SEGV >> >> This program loops 4 billion times, then exits (by crashing). > > ... >> >> Looks like loop alignment to 8 bytes does not matter (in this particular >> example). > > > > I looked into it more. I read Agner's Fog > http://www.agner.org/optimize/microarchitecture.pdf > > Since Nehalem, Intel CPUs have loopback buffer, > differently implemented in different CPUs. > > I use the following code with 4-billion iteration loop > with various numbers of padding NOPs: > > 0000000000400100 <_start>: > 400100: b8 00 28 6b ee mov $0xee6b2800,%eax > 400105: 90 nop > 400106: 90 nop > 0000000000400107 <loop>: > 400107: ff c8 dec %eax > 400109: 8d 88 d2 04 00 00 lea 0x4d2(%rax),%ecx > 40010f: 75 f6 jne 400107 <loop> > > 400111: b8 e7 00 00 00 mov $0xe7,%eax > 400116: 0f 05 syscall > > On Skylake, the loop slows down if its body crosses 16 bytes > (as shown above - last JNE insn doesn't fit). > > With loop starting at 0000000000400106 and fitting into an aligned 16-byte > block: > > Performance counter stats for './z6' (10 runs): > 1209.051244 task-clock (msec) # 0.999 CPUs utilized > ( +- 0.99% ) > 5 context-switches # 0.004 K/sec > ( +- 11.11% ) > 2 page-faults # 0.002 K/sec > ( +- 4.76% ) > 4,101,694,215 cycles # 3.392 GHz > ( +- 0.51% ) > 12,027,931,896 instructions # 2.93 insn per cycle > ( +- 0.00% ) > 4,005,295,446 branches # 3312.759 M/sec > ( +- 0.00% ) > 15,828 branch-misses # 0.00% of all branches > ( +- 4.49% ) > 1.209910890 seconds time elapsed > ( +- 0.99% ) > > With loop starting at 0000000000400107: > > Performance counter stats for './z7' (10 runs): > 1408.362422 task-clock (msec) # 0.999 CPUs utilized > ( +- 1.23% ) > 5 context-switches # 0.004 K/sec > ( +- 15.59% ) > 2 page-faults # 0.001 K/sec > ( +- 4.76% ) > 4,749,031,319 cycles # 3.372 GHz > ( +- 0.34% ) > 12,032,488,082 instructions # 2.53 insn per cycle > ( +- 0.00% ) > 4,006,159,536 branches # 2844.552 M/sec > ( +- 0.00% ) > 6,946 branch-misses # 0.00% of all branches > ( +- 3.88% ) > 1.409459099 seconds time elapsed > ( +- 1.23% ) > > With loop starting at 0000000000400108: > > Performance counter stats for './z8' (10 runs): > 1407.127953 task-clock (msec) # 0.999 CPUs utilized > ( +- 1.09% ) > 6 context-switches # 0.004 K/sec > ( +- 15.70% ) > 2 page-faults # 0.002 K/sec > ( +- 6.64% ) > 4,747,410,967 cycles # 3.374 GHz > ( +- 0.39% ) > 12,032,462,223 instructions # 2.53 insn per cycle > ( +- 0.00% ) > 4,006,154,637 branches # 2847.044 M/sec > ( +- 0.00% ) > 7,324 branch-misses # 0.00% of all branches > ( +- 3.40% ) > 1.408205377 seconds time elapsed > ( +- 1.08% ) > > The difference is significant and reproducible. > > Thus. For this CPU, alignment of loops to 8 bytes is wrong: it helps if it > happens > to align a loop to 16 bytes, but it may in fact hurt performance if it > happens to align > a loop to 16+8 bytes and this pushes loop's body end over the next 16-byte > boundary, > as it happens in the above example. > > I suspect something similar was seen sometime ago on a different, earlier > CPU, > and on _that_ CPU decoder/loop buffer idiosyncrasies are such that it likes > 8 byte alignment. > > It's not true that such alignment is always a win.
It looks to me that all you want is to drop the 8-byte alignment on entities that are smaller than a cacheline. So you should implement that, rather than dropping the 8-byte alignment on every entity, even those larger than a cacheline. In fact, a new '.align-to-8-or-to-make-N-bytes-fit-into-the-current-cacheline' directive may help here. Of course the compiler needs to compute N or compute it via labels. Richard.