[Bug target/69622] New: compiler reordering of non-temporal (write-combining) stores produces significant performance hit

peter at cordes dot ca Tue, 02 Feb 2016 04:16:28 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69622


            Bug ID: 69622
           Summary: compiler reordering of non-temporal (write-combining)
                    stores produces significant performance hit
           Product: gcc
           Version: 5.3.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: i386-linux-gnu, x86_64-linux-gnu

IDK whether to mark this as "target" or something else.  Other architectures
might have similar write-combining stores that are sensitive to writing whole
cache-lines at once.


For background, see this SO question:
http://stackoverflow.com/questions/25778302/wrong-gcc-generated-assembly-ordering-results-in-performance-hit

In an unrolled copy loop, gcc decides to emit  vmovntdq  stores in a different
order than they appear in the source.  There's no correctness issue, but the
amount of fill-buffers is very limited (maybe each core has 10 or so?).  So
it's *much* better to write all of one cacheline, then all of the next
cacheline.  See my answer on that SO question for lots of discussion and links.

The poster of that question got a 33% speedup (from ~10.2M packets per second
to ~13.3M packets per second by putting the loads and stores in source order in
the binary.  (Unknown hardware and surrounding code, but presumably this loop
is *the* bottleneck in his app).  Anyway, real numbers show that this isn't
just a theoretical argument that some code would be better.


Compilable test-case that demonstrates the issue:

#include <stdint.h>
#include <immintrin.h>

//#define compiler_writebarrier()   __asm__ __volatile__ ("")
#define compiler_writebarrier()   // empty.

void copy_mcve(void *const destination, const void *const source, const size_t
bytes)
{
          __m256i *dst  = destination;
    const __m256i *src  = source;
    const __m256i *dst_endp = (destination + bytes);

    while (dst < dst_endp)  { 
      __m256i m0 = _mm256_load_si256( src + 0 );
      __m256i m1 = _mm256_load_si256( src + 1 );
      __m256i m2 = _mm256_load_si256( src + 2 );
      __m256i m3 = _mm256_load_si256( src + 3 );

      _mm256_stream_si256( dst+0, m0 );
      compiler_writebarrier();   // even one anywhere in the loop is enough for
current gcc
      _mm256_stream_si256( dst+1, m1 );
      compiler_writebarrier();
      _mm256_stream_si256( dst+2, m2 );
      compiler_writebarrier();
      _mm256_stream_si256( dst+3, m3 );
      compiler_writebarrier();

      src += 4;
      dst += 4;
    }

}

compiles (with the barriers defined as a no-op) to (gcc 5.3.0 -O3
-march=haswell:  http://goo.gl/CwtpS7): 

copy_mcve:
        addq    %rdi, %rdx
        cmpq    %rdx, %rdi
        jnb     .L7
.L5:
        vmovdqa 32(%rsi), %ymm2
        subq    $-128, %rdi
        subq    $-128, %rsi
        vmovdqa -64(%rsi), %ymm1
        vmovdqa -32(%rsi), %ymm0
        vmovdqa -128(%rsi), %ymm3
     # If dst is aligned, the four halves of two cache lines are {A B} {C D}:
        vmovntdq        %ymm2, -96(%rdi)     # B
        vmovntdq        %ymm1, -64(%rdi)     # C
        vmovntdq        %ymm0, -32(%rdi)     # D
        vmovntdq        %ymm3, -128(%rdi)    # A
        cmpq    %rdi, %rdx
        ja      .L5
        vzeroupper
.L7:    ret


If the output buffer is aligned, that B C D A store ordering maximally
separates the two halves of the first cache line, giving the most opportunity
for partially-full fill buffers to get flushed.


Doing the +32 load first makes no sense with that placement of the
pointer-increment instructions.  Doing the +0 load first could save a byte of
code-size by not needing a displacement byte.  I'm guessing that's what one
optimizer function was going for when it put the subs there, but then something
else came along and re-ordered the loads.

Is there something that tries to touch both cache-lines as early as possible,
to trigger the loads?  Assuming the buffer is 64B-aligned?

Doing the subs after the last store would save another insn byte, because one
of the stores could use an empty displacement as well.  That's where clang puts
the pointer increments (and it keeps the loads and stores in source order). 
clang also uses vmovaps / vmovntps.  It's probably a holdover from saving an
insn byte in the non-VEX encoding of the 128b insn, but does make the output
work with AVX1 instead of requiring AVX2.


Using a 2-register addressing mode for the loads could save a sub instruction
inside the loop.  Increment dst normally, but reference src with a 2-register
addressing mode with dst and a register initialized with src-dst.  (In the
godbolt link, uncomment the #define ADDRESSING_MODE_HACK.  With ugly enough
source, gcc can be bludgeoned into making code like that.  It wastes insns in
the intro, though, apparently to avoid 3-component (base+index+disp) addresses.
 I've been meaning to check on whether that's a factor).

2-register addressing modes don't seem to micro-fuse in the pipeline in
SnB-family CPUs
(http://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes),
but Agner Fog's testing methods found that they do micro-fuse.  There's a
theory (discussed on a thread on his blog) that they micro-fuse in the uop
cache only.  Anyway, way off topic, sorry.

[Bug target/69622] New: compiler reordering of non-temporal (write-combining) stores produces significant performance hit

Reply via email to