https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69622
Bug ID: 69622 Summary: compiler reordering of non-temporal (write-combining) stores produces significant performance hit Product: gcc Version: 5.3.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: i386-linux-gnu, x86_64-linux-gnu IDK whether to mark this as "target" or something else. Other architectures might have similar write-combining stores that are sensitive to writing whole cache-lines at once. For background, see this SO question: http://stackoverflow.com/questions/25778302/wrong-gcc-generated-assembly-ordering-results-in-performance-hit In an unrolled copy loop, gcc decides to emit vmovntdq stores in a different order than they appear in the source. There's no correctness issue, but the amount of fill-buffers is very limited (maybe each core has 10 or so?). So it's *much* better to write all of one cacheline, then all of the next cacheline. See my answer on that SO question for lots of discussion and links. The poster of that question got a 33% speedup (from ~10.2M packets per second to ~13.3M packets per second by putting the loads and stores in source order in the binary. (Unknown hardware and surrounding code, but presumably this loop is *the* bottleneck in his app). Anyway, real numbers show that this isn't just a theoretical argument that some code would be better. Compilable test-case that demonstrates the issue: #include <stdint.h> #include <immintrin.h> //#define compiler_writebarrier() __asm__ __volatile__ ("") #define compiler_writebarrier() // empty. void copy_mcve(void *const destination, const void *const source, const size_t bytes) { __m256i *dst = destination; const __m256i *src = source; const __m256i *dst_endp = (destination + bytes); while (dst < dst_endp) { __m256i m0 = _mm256_load_si256( src + 0 ); __m256i m1 = _mm256_load_si256( src + 1 ); __m256i m2 = _mm256_load_si256( src + 2 ); __m256i m3 = _mm256_load_si256( src + 3 ); _mm256_stream_si256( dst+0, m0 ); compiler_writebarrier(); // even one anywhere in the loop is enough for current gcc _mm256_stream_si256( dst+1, m1 ); compiler_writebarrier(); _mm256_stream_si256( dst+2, m2 ); compiler_writebarrier(); _mm256_stream_si256( dst+3, m3 ); compiler_writebarrier(); src += 4; dst += 4; } } compiles (with the barriers defined as a no-op) to (gcc 5.3.0 -O3 -march=haswell: http://goo.gl/CwtpS7): copy_mcve: addq %rdi, %rdx cmpq %rdx, %rdi jnb .L7 .L5: vmovdqa 32(%rsi), %ymm2 subq $-128, %rdi subq $-128, %rsi vmovdqa -64(%rsi), %ymm1 vmovdqa -32(%rsi), %ymm0 vmovdqa -128(%rsi), %ymm3 # If dst is aligned, the four halves of two cache lines are {A B} {C D}: vmovntdq %ymm2, -96(%rdi) # B vmovntdq %ymm1, -64(%rdi) # C vmovntdq %ymm0, -32(%rdi) # D vmovntdq %ymm3, -128(%rdi) # A cmpq %rdi, %rdx ja .L5 vzeroupper .L7: ret If the output buffer is aligned, that B C D A store ordering maximally separates the two halves of the first cache line, giving the most opportunity for partially-full fill buffers to get flushed. Doing the +32 load first makes no sense with that placement of the pointer-increment instructions. Doing the +0 load first could save a byte of code-size by not needing a displacement byte. I'm guessing that's what one optimizer function was going for when it put the subs there, but then something else came along and re-ordered the loads. Is there something that tries to touch both cache-lines as early as possible, to trigger the loads? Assuming the buffer is 64B-aligned? Doing the subs after the last store would save another insn byte, because one of the stores could use an empty displacement as well. That's where clang puts the pointer increments (and it keeps the loads and stores in source order). clang also uses vmovaps / vmovntps. It's probably a holdover from saving an insn byte in the non-VEX encoding of the 128b insn, but does make the output work with AVX1 instead of requiring AVX2. Using a 2-register addressing mode for the loads could save a sub instruction inside the loop. Increment dst normally, but reference src with a 2-register addressing mode with dst and a register initialized with src-dst. (In the godbolt link, uncomment the #define ADDRESSING_MODE_HACK. With ugly enough source, gcc can be bludgeoned into making code like that. It wastes insns in the intro, though, apparently to avoid 3-component (base+index+disp) addresses. I've been meaning to check on whether that's a factor). 2-register addressing modes don't seem to micro-fuse in the pipeline in SnB-family CPUs (http://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes), but Agner Fog's testing methods found that they do micro-fuse. There's a theory (discussed on a thread on his blog) that they micro-fuse in the uop cache only. Anyway, way off topic, sorry.