Re: [PATCH] Extend -falign-FOO=N to N[,M]: the second number is max padding

Richard Biener Mon, 15 Aug 2016 02:46:19 -0700

On Fri, Aug 12, 2016 at 9:00 PM, Denys Vlasenko <dvlas...@redhat.com> wrote:
> On 08/12/2016 05:20 PM, Denys Vlasenko wrote:
>>>
>>> Yes, I know all that.  Fetching is one thing.  Loop cache is for instance
>>> another (more important) thing.  Not aligning the loop head increases
>>> chance of the whole loop being split over more cache lines than
>>> necessary.
>>> Jump predictors also don't necessarily decode/remember the whole
>>> instruction address.  And so on.
>>>
>>>> Aligning to 8 bytes within a cacheline does not speed things up. It
>>>> simply wastes bytes without speeding up anything.
>>>
>>>
>>> It's not that easy, which is why I have asked if you have _measured_ the
>>> correctness of your theory of it not mattering?  All the alignment
>>> adjustments in GCC were included after measurements.  In particular the
>>> align-by-8-always (for loop heads) was included after some large
>>> regressions on cpu2000, in 2007 (core2 duo at that time).
>>>
>>> So, I'm never much thrilled about listing reasons for why performance
>>> can't possibly be affected, especially when we know that it once _was_
>>> affected, when there's an easy way to show that it's not affected.
>>
>>
>> z.S:
>>
>> #compile with: gcc -nostartfiles -nostdlib
>> _start:         .globl _start
>>                 .p2align 8
>>                 mov     $4000*1000*1000, %eax # 5-byte insn
>>                 nop     # 6
>>                 nop     # 7
>>                 nop     # 8
>> loop:           dec     %eax
>>                 lea     (%ebx), %ebx
>>                 jnz     loop
>>                 push    $0
>>                 ret     # SEGV
>>
>> This program loops 4 billion times, then exits (by crashing).
>
> ...
>>
>> Looks like loop alignment to 8 bytes does not matter (in this particular
>> example).
>
>
>
> I looked into it more. I read Agner's Fog
> http://www.agner.org/optimize/microarchitecture.pdf
>
> Since Nehalem, Intel CPUs have loopback buffer,
> differently implemented in different CPUs.
>
> I use the following code with 4-billion iteration loop
> with various numbers of padding NOPs:
>
> 0000000000400100 <_start>:
>   400100:       b8 00 28 6b ee          mov    $0xee6b2800,%eax
>   400105:       90                      nop
>   400106:       90                      nop
> 0000000000400107 <loop>:
>   400107:       ff c8                   dec    %eax
>   400109:       8d 88 d2 04 00 00       lea    0x4d2(%rax),%ecx
>   40010f:       75 f6                   jne    400107 <loop>
>
>   400111:       b8 e7 00 00 00          mov    $0xe7,%eax
>   400116:       0f 05                   syscall
>
> On Skylake, the loop slows down if its body crosses 16 bytes
> (as shown above - last JNE insn doesn't fit).
>
> With loop starting at 0000000000400106 and fitting into an aligned 16-byte
> block:
>
>  Performance counter stats for './z6' (10 runs):
>        1209.051244      task-clock (msec)         #    0.999 CPUs utilized
> ( +-  0.99% )
>                  5      context-switches          #    0.004 K/sec
> ( +- 11.11% )
>                  2      page-faults               #    0.002 K/sec
> ( +-  4.76% )
>      4,101,694,215      cycles                    #    3.392 GHz
> ( +-  0.51% )
>     12,027,931,896      instructions              #    2.93  insn per cycle
> ( +-  0.00% )
>      4,005,295,446      branches                  # 3312.759 M/sec
> ( +-  0.00% )
>             15,828      branch-misses             #    0.00% of all branches
> ( +-  4.49% )
>        1.209910890 seconds time elapsed
> ( +-  0.99% )
>
> With loop starting at 0000000000400107:
>
>  Performance counter stats for './z7' (10 runs):
>        1408.362422      task-clock (msec)         #    0.999 CPUs utilized
> ( +-  1.23% )
>                  5      context-switches          #    0.004 K/sec
> ( +- 15.59% )
>                  2      page-faults               #    0.001 K/sec
> ( +-  4.76% )
>      4,749,031,319      cycles                    #    3.372 GHz
> ( +-  0.34% )
>     12,032,488,082      instructions              #    2.53  insn per cycle
> ( +-  0.00% )
>      4,006,159,536      branches                  # 2844.552 M/sec
> ( +-  0.00% )
>              6,946      branch-misses             #    0.00% of all branches
> ( +-  3.88% )
>        1.409459099 seconds time elapsed
> ( +-  1.23% )
>
> With loop starting at 0000000000400108:
>
>  Performance counter stats for './z8' (10 runs):
>        1407.127953      task-clock (msec)         #    0.999 CPUs utilized
> ( +-  1.09% )
>                  6      context-switches          #    0.004 K/sec
> ( +- 15.70% )
>                  2      page-faults               #    0.002 K/sec
> ( +-  6.64% )
>      4,747,410,967      cycles                    #    3.374 GHz
> ( +-  0.39% )
>     12,032,462,223      instructions              #    2.53  insn per cycle
> ( +-  0.00% )
>      4,006,154,637      branches                  # 2847.044 M/sec
> ( +-  0.00% )
>              7,324      branch-misses             #    0.00% of all branches
> ( +-  3.40% )
>        1.408205377 seconds time elapsed
> ( +-  1.08% )
>
> The difference is significant and reproducible.
>
> Thus. For this CPU, alignment of loops to 8 bytes is wrong: it helps if it
> happens
> to align a loop to 16 bytes, but it may in fact hurt performance if it
> happens to align
> a loop to 16+8 bytes and this pushes loop's body end over the next 16-byte
> boundary,
> as it happens in the above example.
>
> I suspect something similar was seen sometime ago on a different, earlier
> CPU,
> and on _that_ CPU decoder/loop buffer idiosyncrasies are such that it likes
> 8 byte alignment.
>
> It's not true that such alignment is always a win.


It looks to me that all you want is to drop the 8-byte alignment on
entities that
are smaller than a cacheline.  So you should implement that, rather than
dropping the 8-byte alignment on every entity, even those larger than
a cacheline.

In fact, a new '.align-to-8-or-to-make-N-bytes-fit-into-the-current-cacheline'
directive
may help here.  Of course the compiler needs to compute N or compute it via
labels.

Richard.

Re: [PATCH] Extend -falign-FOO=N to N[,M]: the second number is max padding

Reply via email to