On Sun, Nov 17, 2013 at 04:42:18PM +0100, Richard Biener wrote: > "Ondřej Bílka" <nel...@seznam.cz> wrote: > >On Sat, Nov 16, 2013 at 11:37:36AM +0100, Richard Biener wrote: > >> "Ondřej Bílka" <nel...@seznam.cz> wrote: > >> >On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote: > >> > >> IIRC what can still be seen is store-buffer related slowdowns when > >you have a big unaligned store load in your loop. Thus aligning stores > >still pays back last time I measured this. > > > >Then send you benchmark. What I did is a loop that stores 512 bytes. > >Unaligned stores there are faster than aligned ones, so tell me when > >aligning stores pays itself. Note that in filling store buffer you must > >take into account extra stores to make loop aligned. > > The issue is that the effective write bandwidth can be limited by the store > buffer if you have multiple write streams. IIRC at least some amd CPUs have > to use two entries for stores crossing a cache line boundary. > So can be performance limited by branch misprediction. You need to show that likely bottleneck is too much writes and not other factor. > Anyway, a look into the optimization manuals will tell you what to do and the > cost model should follow these recommendations. > These tend to be quite out of data, you typically need to recheck everything.
Take Intel® 64 and IA-32 Architectures Optimization Reference Manual from April 2012 A sugestion on store load forwarding there is to align loads and stores to make it working (with P4 and core2 suggestions). However this is false since nehalem, when I test a variant of memcpy that is unaligned by one byte, code is following (full benchmark attached.): set: .LFB0: .cfi_startproc xor %rdx, %rdx addq $1, %rsi lea 144(%rsi), %rdi .L: movdqu 0(%rsi,%rdx), %xmm0 movdqu 16(%rsi,%rdx), %xmm1 ... movdqu 112(%rsi,%rdx), %xmm7 movdqu %xmm0, 0(%rdi,%rdx) ... movdqu %xmm7, 112(%rdi,%rdx) addq $128, %rdx cmp $5120, %rdx jle .L ret Then there is around 10% slowdown vs nonforwarding one. real 0m2.098s user 0m2.083s sys 0m0.003s However when I set 'in lea 144(%rsi), %rdi' a 143 or other nonmultiple of 16 then performance degrades. real 0m3.495s user 0m3.480s sys 0m0.000s And other suggestions are similarly flimsy unless your target is pentium 4. > >Also what do you do with loops that contain no store? If I modify test > >to > > > >int set(int *p, int *q){ > > int i; > > int sum = 0; > > for (i=0; i < 128; i++) > > sum += 42 * p[i]; > > return sum; > >} > > > >then it still does aligning. > > Because the cost model simply does not exist for the decision whether to peel > or not. Patches welcome. > > >There may be a threshold after which aligning buffer makes sense then > >you > >need to show that loop spend most of time on sizes after that treshold. > > > >Also do you have data how common store-buffer slowdowns are? Without > >knowing that you risk that you make few loops faster at expense of > >majority which could likely slow whole application down. It would not > >supprise me as these loops can be ran mostly on L1 cache data (which is > >around same level as assuming that increased code size fits into > >instruction cache.) > > > > > >Actually these questions could be answered by a test, first compile > >SPEC2006 with vanilla gcc -O3 and then with gcc that contains patch to > >use unaligned loads. Then results will tell if peeling is also good in > >practice or not. > > It should not be a on or off decision but rather a decision based on a cost > model. > You cannot decide that on cost model alone as performance is decided by runtime usage pattern. If you do profiling then you could do that. Alternatively you can add a branch to enable peeling only after preset treshold.
#define _GNU_SOURCE #include <stdlib.h> #include <malloc.h> int main(){ char *ptr = pvalloc(2 * SIZE + 10000); char *ptr2 = pvalloc(2 * SIZE + 10000); unsigned long p = 31; unsigned long q = 17; int i; for (i=0; i < 8000000; i++) { set (ptr + 64 * (p % (SIZE / 64)), ptr2 + 64 * (q % (SIZE /64))); p = 11 * p + 3; q = 13 * p + 5; } }
.file "set1.c" .text .p2align 4,,15 .globl set .type set, @function set: .LFB0: .cfi_startproc xor %rdx, %rdx addq $1, %rsi lea 144(%rsi), %rdi .L: movdqu 0(%rsi,%rdx), %xmm0 movdqu 16(%rsi,%rdx), %xmm1 movdqu 32(%rsi,%rdx), %xmm2 movdqu 48(%rsi,%rdx), %xmm3 movdqu 64(%rsi,%rdx), %xmm4 movdqu 80(%rsi,%rdx), %xmm5 movdqu 96(%rsi,%rdx), %xmm6 movdqu 112(%rsi,%rdx), %xmm7 movdqu %xmm0, 0(%rdi,%rdx) movdqu %xmm1, 16(%rdi,%rdx) movdqu %xmm2, 32(%rdi,%rdx) movdqu %xmm3, 48(%rdi,%rdx) movdqu %xmm4, 64(%rdi,%rdx) movdqu %xmm5, 80(%rdi,%rdx) movdqu %xmm6, 96(%rdi,%rdx) movdqu %xmm7, 112(%rdi,%rdx) addq $128, %rdx cmp $5120, %rdx jle .L ret .cfi_endproc .LFE0: .size set, .-set .ident "GCC: (Debian 4.8.1-10) 4.8.1" .section .note.GNU-stack,"",@progbits