On Thu, Aug 22, 2013 at 1:24 AM, Ondřej Bílka <nel...@seznam.cz> wrote:
> On Wed, Aug 21, 2013 at 11:50:34PM -0700, Xinliang David Li wrote:
>> > The effect on runtime is not correlated to
>> > either (which means the vectorizer cost model is rather bad), but integer
>> > code usually does not benefit at all.
>>
>> The cost model does need some tuning. For instance, GCC vectorizer
>> does peeling aggressively, but  peeling in many cases can be avoided
>> while still gaining good performance -- even when target does not have
>> efficient unaligned load/store to implement unaligned access. GCC
>> reports too high cost for unaligned access while too low for peeling
>> overhead.
>>
> Another issue is that gcc generates very ineffective headers. If I
> change example with following line
>
> foo(a+rand()%10000, b+rand()%10000, c+rand()%10000, rand()%64);
>
> then I get vectorizer regression of
> gcc-4.7 -O3 x.c -o xa
> versus
> gcc-4.7 -O2 -funroll-all-loops x.c -o xb
>
>> Example:
>>
>> ifndef TYPE
>> #define TYPE float
>> #endif
>> #include <stdlib.h>
>>
>> __attribute__((noinline)) void
>> foo (TYPE *a, TYPE* b, TYPE *c, int n)
>> {
>>    int i;
>>    for ( i = 0; i < n; i++)
>>      a[i] = b[i] * c[i];
>> }
>>
>> int g;
>> int
>> main()
>> {
>>    int i;
>>    float *a = (float*) malloc (100000*4);
>>    float *b = (float*) malloc (100000*4);
>>    float *c = (float*) malloc (100000*4);
>>
>>    for (i = 0; i < 100000; i++)
>>       foo(a, b, c, 100000);
>>
>>
>>    g = a[10];
>>
>> }
>>


Good test.   I also change the test case to force the start address to
be misaligned by calling foo with foo(a+1,b+1, c+1, 10000), 3)'s
performance drops from 1.5s to 2.5s, but still much better than 2) and
4).  One correction -- plain O2 is the slowest -- the runtime is about
5.4s.  (all tests use trunk compiler with -O2 -ftree-vectorize).

I tried your case with trunk compiler (O2 -ftree-vectorize), the
runtime on a westmere machine:

1) -march=corei7 : 2.1s
2) -march=x86-64: 4.8s
3) NOPEEL + -march=corei7 : 2.2s
4) NOPEEL + -march=x86-64: 4.8s
5) -O2  : 5.5s
6) -O3 -funroll-all-loops -march=corei6 : 2.2s
7) -O3 -funroll-all-loops -march=x86-64: 4.3s
8) -O2 -funroll-all-loops : 4.6s


With random start address alignment, 3) is very close to 1) in reality
so it is the best choice.


>>
>> 1) by default, GCC's vectorizer will peel the loop in foo, so that
>> access to 'a' is aligned and using movaps instruction. The other
>> accesses are using movups when -march=corei7 is used
>> 2) Same as above, but -march=x86_64. Access to b is split into 'movlps
>> and movhps', same for 'c'
>>
>> 3) Disabling peeling (via a hack) with -march=corei7 --- all three
>> accesses are using movups
>> 4) Disabling peeling, with -march=x86-64 -- all three accesses are
>> using movlps/movhps
>>
>> Performance:
>>
>> 1) and 3) -- both 1.58s, but 3) is much smaller than 1). 3)'s text is
>> 1462 bytes, and 1) is 1622 bytes
>> 3) and 4) and no vectorize -- all very slow -- 4.8s
>>
> This could be explained by lack of unrolling. When unrolling is enabled
> a slowdown is only 20% over sse variant.

Standalone unroller tuning is an orthogonal issue here.  Note that we
are shooting for the best possible vectorizer performance (to be
turned on at O2) under the strict size/compile time increase
constraints.

>
>> > That said, I expect 99% of used software
>> > (probably rather 99,99999%) is not compiled on the system it runs on but
>> > compiled to run on generic hardware and thus restricts itself to bare 
>> > x86_64
>> > SSE2 features.  So what matters for enabling the vectorizer at -O2 is the
>> > default architecture features of the given architecture(!) - remember
>> > to not only
>> > consider x86 here!
>> >
> This is non-issue as sse2 already contains most of operations needed.
> Performance improvement from additional ss* is minimal.
>
> A performance improvements over sse2 could be with avx/avx2 but it would
> vectorizer of avx is still severely lacking.
>
>> > The same argument was done on the fact that GCC does not optimize by 
>> > default
>> > but uses -O0.  It's a straw-mans argument.  All "benchmarking" I see uses
>> > -O3 or -Ofast already.
>>
>> People can just do -O2 performance comparison.
>>
> When machines spend 95% of time in code compiled by gcc -O2 then
> benchmarking should be done on -O2.
> With any other flags you will just get bunch of numbers which are not
> very related to performance.

yes.


thanks,

David
>
>

Reply via email to