On Wed, Aug 21, 2013 at 11:50:34PM -0700, Xinliang David Li wrote: > > The effect on runtime is not correlated to > > either (which means the vectorizer cost model is rather bad), but integer > > code usually does not benefit at all. > > The cost model does need some tuning. For instance, GCC vectorizer > does peeling aggressively, but peeling in many cases can be avoided > while still gaining good performance -- even when target does not have > efficient unaligned load/store to implement unaligned access. GCC > reports too high cost for unaligned access while too low for peeling > overhead. > Another issue is that gcc generates very ineffective headers. If I change example with following line
foo(a+rand()%10000, b+rand()%10000, c+rand()%10000, rand()%64); then I get vectorizer regression of gcc-4.7 -O3 x.c -o xa versus gcc-4.7 -O2 -funroll-all-loops x.c -o xb > Example: > > ifndef TYPE > #define TYPE float > #endif > #include <stdlib.h> > > __attribute__((noinline)) void > foo (TYPE *a, TYPE* b, TYPE *c, int n) > { > int i; > for ( i = 0; i < n; i++) > a[i] = b[i] * c[i]; > } > > int g; > int > main() > { > int i; > float *a = (float*) malloc (100000*4); > float *b = (float*) malloc (100000*4); > float *c = (float*) malloc (100000*4); > > for (i = 0; i < 100000; i++) > foo(a, b, c, 100000); > > > g = a[10]; > > } > > > 1) by default, GCC's vectorizer will peel the loop in foo, so that > access to 'a' is aligned and using movaps instruction. The other > accesses are using movups when -march=corei7 is used > 2) Same as above, but -march=x86_64. Access to b is split into 'movlps > and movhps', same for 'c' > > 3) Disabling peeling (via a hack) with -march=corei7 --- all three > accesses are using movups > 4) Disabling peeling, with -march=x86-64 -- all three accesses are > using movlps/movhps > > Performance: > > 1) and 3) -- both 1.58s, but 3) is much smaller than 1). 3)'s text is > 1462 bytes, and 1) is 1622 bytes > 3) and 4) and no vectorize -- all very slow -- 4.8s > This could be explained by lack of unrolling. When unrolling is enabled a slowdown is only 20% over sse variant. > > That said, I expect 99% of used software > > (probably rather 99,99999%) is not compiled on the system it runs on but > > compiled to run on generic hardware and thus restricts itself to bare x86_64 > > SSE2 features. So what matters for enabling the vectorizer at -O2 is the > > default architecture features of the given architecture(!) - remember > > to not only > > consider x86 here! > > This is non-issue as sse2 already contains most of operations needed. Performance improvement from additional ss* is minimal. A performance improvements over sse2 could be with avx/avx2 but it would vectorizer of avx is still severely lacking. > > The same argument was done on the fact that GCC does not optimize by default > > but uses -O0. It's a straw-mans argument. All "benchmarking" I see uses > > -O3 or -Ofast already. > > People can just do -O2 performance comparison. > When machines spend 95% of time in code compiled by gcc -O2 then benchmarking should be done on -O2. With any other flags you will just get bunch of numbers which are not very related to performance.