Re: Propose moving vectorization from -O3 to -O2.

Ondřej Bílka Thu, 22 Aug 2013 01:25:27 -0700

On Wed, Aug 21, 2013 at 11:50:34PM -0700, Xinliang David Li wrote:
> > The effect on runtime is not correlated to
> > either (which means the vectorizer cost model is rather bad), but integer
> > code usually does not benefit at all.
> 
> The cost model does need some tuning. For instance, GCC vectorizer
> does peeling aggressively, but  peeling in many cases can be avoided
> while still gaining good performance -- even when target does not have
> efficient unaligned load/store to implement unaligned access. GCC
> reports too high cost for unaligned access while too low for peeling
> overhead.
>
Another issue is that gcc generates very ineffective headers. If I
change example with following line


foo(a+rand()%10000, b+rand()%10000, c+rand()%10000, rand()%64);

then I get vectorizer regression of
gcc-4.7 -O3 x.c -o xa
versus
gcc-4.7 -O2 -funroll-all-loops x.c -o xb

> Example:
> 
> ifndef TYPE
> #define TYPE float
> #endif
> #include <stdlib.h>
> 
> __attribute__((noinline)) void
> foo (TYPE *a, TYPE* b, TYPE *c, int n)
> {
>    int i;
>    for ( i = 0; i < n; i++)
>      a[i] = b[i] * c[i];
> }
> 
> int g;
> int
> main()
> {
>    int i;
>    float *a = (float*) malloc (100000*4);
>    float *b = (float*) malloc (100000*4);
>    float *c = (float*) malloc (100000*4);
> 
>    for (i = 0; i < 100000; i++)
>       foo(a, b, c, 100000);
> 
> 
>    g = a[10];
> 
> }
> 
> 
> 1) by default, GCC's vectorizer will peel the loop in foo, so that
> access to 'a' is aligned and using movaps instruction. The other
> accesses are using movups when -march=corei7 is used
> 2) Same as above, but -march=x86_64. Access to b is split into 'movlps
> and movhps', same for 'c'
> 
> 3) Disabling peeling (via a hack) with -march=corei7 --- all three
> accesses are using movups
> 4) Disabling peeling, with -march=x86-64 -- all three accesses are
> using movlps/movhps
> 
> Performance:
> 
> 1) and 3) -- both 1.58s, but 3) is much smaller than 1). 3)'s text is
> 1462 bytes, and 1) is 1622 bytes
> 3) and 4) and no vectorize -- all very slow -- 4.8s
> 
This could be explained by lack of unrolling. When unrolling is enabled
a slowdown is only 20% over sse variant.
 
> > That said, I expect 99% of used software
> > (probably rather 99,99999%) is not compiled on the system it runs on but
> > compiled to run on generic hardware and thus restricts itself to bare x86_64
> > SSE2 features.  So what matters for enabling the vectorizer at -O2 is the
> > default architecture features of the given architecture(!) - remember
> > to not only
> > consider x86 here!
> >
This is non-issue as sse2 already contains most of operations needed.
Performance improvement from additional ss* is minimal.

A performance improvements over sse2 could be with avx/avx2 but it would
vectorizer of avx is still severely lacking.

> > The same argument was done on the fact that GCC does not optimize by default
> > but uses -O0.  It's a straw-mans argument.  All "benchmarking" I see uses
> > -O3 or -Ofast already.
> 
> People can just do -O2 performance comparison.
>
When machines spend 95% of time in code compiled by gcc -O2 then
benchmarking should be done on -O2. 
With any other flags you will just get bunch of numbers which are not
very related to performance.

Re: Propose moving vectorization from -O3 to -O2.

Reply via email to