On 08/12/2016 04:28 PM, Michael Matz wrote:
Hi,

On Fri, 12 Aug 2016, Denys Vlasenko wrote:

Have you tested the performance impact of your patch?  Note that the
macro you changed is used for function and code label alignment.  So,
unless I misunderstand something that means that if the large
alignment can't be achieved for e.g. a loop start label, you won't
align it at all anymore. This should be fairly catastrophic for any
loopy benchmark, so anything like this would have to be checked on a
couple benchmarks from cpu2000 (possibly cpu2006), which has some that
are extremely alignment sensitive.

Even for function labels I'd find no alignment at all strange, and I
don't see why you'd want this.

For many generations now, x86 CPUs have at least 32, and usually 64 byte
cachelines. Decoders fetch instructions in blocks of 32 or 64 bytes. Not
less. Instructions which are "misaligned" (for example, starting at byte
5) within a cacheline but still fitting into one cacheline are fetched
in one go, with no penalty.

Yes, I know all that.  Fetching is one thing.  Loop cache is for instance
another (more important) thing.  Not aligning the loop head increases
chance of the whole loop being split over more cache lines than necessary.
Jump predictors also don't necessarily decode/remember the whole
instruction address.  And so on.

Aligning to 8 bytes within a cacheline does not speed things up. It
simply wastes bytes without speeding up anything.

It's not that easy, which is why I have asked if you have _measured_ the
correctness of your theory of it not mattering?  All the alignment
adjustments in GCC were included after measurements.  In particular the
align-by-8-always (for loop heads) was included after some large
regressions on cpu2000, in 2007 (core2 duo at that time).

So, I'm never much thrilled about listing reasons for why performance
can't possibly be affected, especially when we know that it once _was_
affected, when there's an easy way to show that it's not affected.

z.S:

#compile with: gcc -nostartfiles -nostdlib
_start:         .globl _start
                .p2align 8
                mov     $4000*1000*1000, %eax # 5-byte insn
                nop     # 6
                nop     # 7
                nop     # 8
loop:           dec     %eax
                lea     (%ebx), %ebx
                jnz     loop
                push    $0
                ret     # SEGV

This program loops 4 billion times, then exits (by crashing).

I build two executables from it, z8 as shown above, which has its loop 8-byte 
aligned:

$ objdump -dr z8
z8:     file format elf64-x86-64
Disassembly of section .text:
0000000000400100 <_start>:
  400100:       b8 00 28 6b ee          mov    $0xee6b2800,%eax
  400105:       90                      nop
  400106:       90                      nop
  400107:       90                      nop
0000000000400108 <loop>:
  400108:       ff c8                   dec    %eax
  40010a:       67 8d 1b                lea    (%ebx),%ebx
  40010d:       75 f9                   jne    400108 <loop>
  40010f:       6a 00                   pushq  $0x0
  400111:       c3                      retq

and z7, which has one NOP removed and therefore its loop starts
at 0000000000400107.


$ perf stat -r20 ./z7
 Performance counter stats for './z7' (20 runs):
       1204.217409      task-clock (msec)         #    0.972 CPUs utilized      
      ( +-  0.19% )
                10      context-switches          #    0.009 K/sec              
      ( +- 15.69% )
                 0      cpu-migrations            #    0.000 K/sec              
      ( +- 77.80% )
                 3      page-faults               #    0.003 K/sec              
      ( +-  2.87% )
     4,220,236,037      cycles                    #    3.505 GHz                
      ( +-  0.20% )
    12,030,574,486      instructions              #    2.85  insn per cycle     
      ( +-  0.00% )
     4,005,827,208      branches                  # 3326.498 M/sec              
      ( +-  0.00% )
            22,338      branch-misses             #    0.00% of all branches    
      ( +-  4.10% )

       1.238638386 seconds time elapsed                                         
 ( +-  0.19% )

$ perf stat -r20 ./z8
 Performance counter stats for './z8' (20 runs):
       1203.453938      task-clock (msec)         #    0.973 CPUs utilized      
      ( +-  0.27% )
                 8      context-switches          #    0.007 K/sec              
      ( +- 14.46% )
                 0      cpu-migrations            #    0.000 K/sec              
      ( +- 54.61% )
                 3      page-faults               #    0.003 K/sec              
      ( +-  2.60% )
     4,233,994,227      cycles                    #    3.518 GHz                
      ( +-  0.27% )
    12,030,085,275      instructions              #    2.84  insn per cycle     
      ( +-  0.00% )
     4,005,715,106      branches                  # 3328.516 M/sec              
      ( +-  0.00% )
            21,486      branch-misses             #    0.00% of all branches    
      ( +-  4.42% )

       1.236360951 seconds time elapsed                                         
 ( +-  0.26% )


z8 is 0.2% faster. Lets try another run?


 Performance counter stats for './z7' (20 runs):

       1217.476778      task-clock (msec)         #    0.972 CPUs utilized      
      ( +-  0.30% )
                 8      context-switches          #    0.006 K/sec              
      ( +- 10.98% )
                 0      cpu-migrations            #    0.000 K/sec              
      ( +- 27.14% )
                 3      page-faults               #    0.003 K/sec              
      ( +-  3.06% )
     4,252,346,035      cycles                    #    3.493 GHz                
      ( +-  0.17% )
    12,030,474,923      instructions              #    2.83  insn per cycle     
      ( +-  0.00% )
     4,005,793,752      branches                  # 3290.242 M/sec              
      ( +-  0.00% )
            22,640      branch-misses             #    0.00% of all branches    
      ( +-  6.52% )

       1.252268537 seconds time elapsed                                         
 ( +-  0.32% )

 Performance counter stats for './z8' (20 runs):

       1220.024012      task-clock (msec)         #    0.973 CPUs utilized      
      ( +-  0.35% )
                 8      context-switches          #    0.006 K/sec              
      ( +- 12.55% )
                 0      cpu-migrations            #    0.000 K/sec              
      ( +- 39.74% )
                 3      page-faults               #    0.003 K/sec              
      ( +-  2.87% )
     4,247,690,562      cycles                    #    3.482 GHz                
      ( +-  0.27% )
    12,032,460,554      instructions              #    2.83  insn per cycle     
      ( +-  0.01% )
     4,006,219,524      branches                  # 3283.722 M/sec              
      ( +-  0.01% )
            26,651      branch-misses             #    0.00% of all branches    
      ( +-  7.73% )

       1.253366584 seconds time elapsed                                         
 ( +-  0.36% )


Now z7 is 0.1% faster.

Looks like loop alignment to 8 bytes does not matter (in this particular 
example).

Reply via email to