Re: Vectorization: Loop peeling with misaligned support.

Ondřej Bílka Sun, 17 Nov 2013 09:25:41 -0800

On Sun, Nov 17, 2013 at 04:42:18PM +0100, Richard Biener wrote:
> "Ondřej Bílka" <nel...@seznam.cz> wrote:
> >On Sat, Nov 16, 2013 at 11:37:36AM +0100, Richard Biener wrote:
> >> "Ondřej Bílka" <nel...@seznam.cz> wrote:
> >> >On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:
> >> 
> >> IIRC what can still be seen is store-buffer related slowdowns when
> >you have a big unaligned store load in your loop.  Thus aligning stores
> >still pays back last time I measured this.
> >
> >Then send you benchmark. What I did is a loop that stores 512 bytes.
> >Unaligned stores there are faster than aligned ones, so tell me when
> >aligning stores pays itself. Note that in filling store buffer you must
> >take into account extra stores to make loop aligned.
> 
> The issue is that the effective write bandwidth can be limited by the store 
> buffer if you have multiple write streams.  IIRC at least some amd CPUs have 
> to use two entries for stores crossing a cache line boundary.
>
So can be performance limited by branch misprediction. You need to show
that likely bottleneck is too much writes and not other factor.
 
> Anyway, a look into the optimization manuals will tell you what to do and the 
> cost model should follow these recommendations.
> 
These tend to be quite out of data, you typically need to recheck
everything.


Take 
Intel® 64 and IA-32 Architectures Optimization Reference Manual 
from April 2012

A sugestion on store load forwarding there is to align loads and stores
to make it working (with P4 and core2 suggestions).

However this is false since nehalem, when I test a variant of memcpy
that is unaligned by one byte, code is following (full benchmark attached.):

        set:
.LFB0:
        .cfi_startproc
        xor     %rdx, %rdx
        addq    $1, %rsi
        lea     144(%rsi), %rdi  
.L:
        movdqu  0(%rsi,%rdx), %xmm0
        movdqu  16(%rsi,%rdx), %xmm1
        ...
        movdqu  112(%rsi,%rdx), %xmm7
        movdqu  %xmm0, 0(%rdi,%rdx)
        ...
        movdqu  %xmm7, 112(%rdi,%rdx)
        addq    $128, %rdx
        cmp     $5120, %rdx
        jle     .L
        ret

Then there is around 10% slowdown vs nonforwarding one.

real    0m2.098s
user    0m2.083s
sys     0m0.003s

However when I set 'in lea 144(%rsi), %rdi' a 143 or other nonmultiple of 16 
then
performance degrades.

real    0m3.495s
user    0m3.480s
sys     0m0.000s

And other suggestions are similarly flimsy unless your target is pentium 4.

> >Also what do you do with loops that contain no store? If I modify test
> >to
> >
> >int set(int *p, int *q){
> >  int i;
> >  int sum = 0;
> >  for (i=0; i < 128; i++)
> >     sum += 42 * p[i];
> >  return sum;
> >}
> >
> >then it still does aligning.
> 
> Because the cost model simply does not exist for the decision whether to peel 
> or not. Patches welcome.
> 
> >There may be a threshold after which aligning buffer makes sense then
> >you
> >need to show that loop spend most of time on sizes after that treshold.
> >
> >Also do you have data how common store-buffer slowdowns are? Without
> >knowing that you risk that you make few loops faster at expense of
> >majority which could likely slow whole application down. It would not
> >supprise me as these loops can be ran mostly on L1 cache data (which is
> >around same level as assuming that increased code size fits into
> >instruction cache.)
> >
> >
> >Actually these questions could be answered by a test, first compile
> >SPEC2006 with vanilla gcc -O3 and then with gcc that contains patch to
> >use unaligned loads. Then results will tell if peeling is also good in
> >practice or not.
> 
> It should not be a on or off decision but rather a decision based on a cost 
> model.
> 
You cannot decide that on cost model alone as performance is decided by
runtime usage pattern. If you do profiling then you could do that.
Alternatively you can add a branch to enable peeling only after preset
treshold.

#define _GNU_SOURCE
#include <stdlib.h>
#include <malloc.h>
int main(){
   char *ptr = pvalloc(2 * SIZE + 10000);
   char *ptr2 = pvalloc(2 * SIZE + 10000);

   unsigned long p = 31;
   unsigned long q = 17;

   int i;
   for (i=0; i < 8000000; i++) {
     set (ptr + 64 * (p % (SIZE / 64)), ptr2 + 64 * (q % (SIZE /64)));
     p = 11 * p + 3;
     q = 13 * p + 5;
   }
}

        .file   "set1.c"
        .text
        .p2align 4,,15
        .globl  set
        .type   set, @function
set:
.LFB0:
        .cfi_startproc
        xor     %rdx, %rdx
        addq    $1, %rsi
        lea     144(%rsi), %rdi  
.L:
        movdqu  0(%rsi,%rdx), %xmm0
        movdqu  16(%rsi,%rdx), %xmm1
        movdqu  32(%rsi,%rdx), %xmm2
        movdqu  48(%rsi,%rdx), %xmm3
        movdqu  64(%rsi,%rdx), %xmm4
        movdqu  80(%rsi,%rdx), %xmm5
        movdqu  96(%rsi,%rdx), %xmm6
        movdqu  112(%rsi,%rdx), %xmm7
        movdqu  %xmm0, 0(%rdi,%rdx)
        movdqu  %xmm1, 16(%rdi,%rdx)
        movdqu  %xmm2, 32(%rdi,%rdx)
        movdqu  %xmm3, 48(%rdi,%rdx)
        movdqu  %xmm4, 64(%rdi,%rdx)
        movdqu  %xmm5, 80(%rdi,%rdx)
        movdqu  %xmm6, 96(%rdi,%rdx)
        movdqu  %xmm7, 112(%rdi,%rdx)
        addq    $128, %rdx
        cmp     $5120, %rdx
        jle     .L
        ret
        .cfi_endproc
.LFE0:
        .size   set, .-set
        .ident  "GCC: (Debian 4.8.1-10) 4.8.1"
        .section        .note.GNU-stack,"",@progbits

Re: Vectorization: Loop peeling with misaligned support.

Reply via email to