On Thu, Aug 22, 2013 at 8:50 AM, Xinliang David Li <davi...@google.com> wrote: >> The effect on runtime is not correlated to >> either (which means the vectorizer cost model is rather bad), but integer >> code usually does not benefit at all. > > The cost model does need some tuning. For instance, GCC vectorizer > does peeling aggressively, but peeling in many cases can be avoided > while still gaining good performance -- even when target does not have > efficient unaligned load/store to implement unaligned access. GCC > reports too high cost for unaligned access while too low for peeling > overhead. > > Example: > > ifndef TYPE > #define TYPE float > #endif > #include <stdlib.h> > > __attribute__((noinline)) void > foo (TYPE *a, TYPE* b, TYPE *c, int n) > { > int i; > for ( i = 0; i < n; i++) > a[i] = b[i] * c[i]; > } > > int g; > int > main() > { > int i; > float *a = (float*) malloc (100000*4); > float *b = (float*) malloc (100000*4); > float *c = (float*) malloc (100000*4); > > for (i = 0; i < 100000; i++) > foo(a, b, c, 100000); > > > g = a[10]; > > } > > > 1) by default, GCC's vectorizer will peel the loop in foo, so that > access to 'a' is aligned and using movaps instruction. The other > accesses are using movups when -march=corei7 is used > 2) Same as above, but -march=x86_64. Access to b is split into 'movlps > and movhps', same for 'c' > > 3) Disabling peeling (via a hack) with -march=corei7 --- all three > accesses are using movups > 4) Disabling peeling, with -march=x86-64 -- all three accesses are > using movlps/movhps > > Performance: > > 1) and 3) -- both 1.58s, but 3) is much smaller than 1). 3)'s text is > 1462 bytes, and 1) is 1622 bytes > 3) and 4) and no vectorize -- all very slow -- 4.8s > > Observations: > a) if properly tuned, for corei7, 3) should be picked by GCC instead > of 1) -- this is not possible today > b) with march=x86_64, GCC should figure out the benefit of vectorizing > the loop is small and bail out > >>> On the other hand, 10% compile time increase due to one pass sounds >>> excessive -- there might be some low hanging fruit to reduce the >>> compile time increase. >> >> I have already spent two man-month speeding up the vectorizer itself, >> I don't think there is any low-hanging fruit left there. But see above - >> most >> of the compile-time is due to the cost of processing the extra loop copies. >> > > Ok. > > I did not notice your patch (in May this year) until recently. Do you > plan to check it in (other than the part to turn in at O2). The cost > model part of the changes are largely independent. If it is in, it > will serve as a good basis for further tuning. > > >>> at full feature set vectorization regresses runtime of quite a number >>> of benchmarks significantly. At reduced feature set - basically trying >>> to vectorize only obvious profitable cases - these regressions can be >>> avoided but progressions only remain on two spec fp cases. As most >>> user applications fall into the spec int category a 10% compile-time >>> and 15% code-size regression for no gain is no good. >>>> >>> >>> Cong's data (especially corei7 and corei7avx) shows more significant >>> performance improvement. If 10% compile time increase is across the >>> board and happens on benchmarks with no performance improvement, it is >>> certainly bad - but I am not sure if that is the case. >> >> Note that we are talking about -O2 - people that enable -march=corei7 usually >> know to use -O3 or FDO anyway. > > Many people uses FDO, but not all -- there are still some barriers for > adoption. There are reasons people may not want to use O3: > 1) people feel most comfortable to use O2 because it is considered the > most thoroughly tested compiler optimization level; Going with the > default is the natural choice. FDO is a different beast as the > performance benefit can be too high to resist; > 2) In a distributed build environment with object file > caching/sharing, building with O3 (different from the default) leads > to longer build time; > 3) The size/compile time cost can be too high with O3. On the other > hand, the benefit of vectorizer can be very high for many types of > applications such as image processing, stitching, image detection, > dsp, encoder/decoder -- other than numerical fortran programs. > > >> That said, I expect 99% of used software >> (probably rather 99,99999%) is not compiled on the system it runs on but >> compiled to run on generic hardware and thus restricts itself to bare x86_64 >> SSE2 features. So what matters for enabling the vectorizer at -O2 is the >> default architecture features of the given architecture(!) - remember >> to not only >> consider x86 here! >> >>> A couple of points I'd like to make: >>> >>> 1) loop vectorizer passes the quality threshold to be turned on by >>> default at O2 in 4.9; It is already turned on for FDO at O2. >> >> With FDO we have a _much_ better way of reasoning on which loops >> we spend the compile-time and code-size! Exactly the problem that >> exists without FDO at -O2 (and also at -O3, but -O3 is not said to >> be well-balanced with regard to compile-time and code-size) >> >>> 2) there are still lots of room for improvement for loop vectorizer -- >>> there is no doubt about it, and we will need to continue improving it; >> >> I believe we have to first do that. See the patches regarding to the >> cost model reorg I posted with the proposal for enabling vectorization at >> -O2. >> One large source of collateral damage of vectorization is if-conversion which >> aggressively if-converts loops regardless of us later vectorizing the result. >> The if-conversion pass needs to be integrated with vectorization. > > We notice some small performance problems with tree-if conversion that > is turned on with FDO -- because that pass does not have cost model > (by looking at branch probability as rtl level if-cvt). What other > problems do you see? is it just compile time concern? > >> >>> 3) the only fast way to improve a feature is to get it used widely so >>> that people can file bugs and report problems -- it is hard for >>> developers to find and collect all cases where GCC is weak without GCC >>> community's help; There might be a temporary regression for some >>> users, but it is worth the pain >> >> Well, introducing known regressions at -O2 is not how this works. >> Vectorization is already widely tested and you can look at a plethora of >> bugreports about missed features and vectorizer wrong-doings to improve it. >> >>> 4) Not the most important one, but a practical concern: without >>> turning it on, GCC will be greatly disadvantaged when people start >>> doing benchmarking latest GCC against other compilers .. >> >> The same argument was done on the fact that GCC does not optimize by default >> but uses -O0. It's a straw-mans argument. All "benchmarking" I see uses >> -O3 or -Ofast already. > > People can just do -O2 performance comparison.
They can also do -O performance comparison. Richard. > thanks, > > David > >> To make vectorization have a bigger impact on day-to-day software GCC would >> need >> to start versioning for the target sub-architecture - which of course >> increases the >> issue with code-size and compile-time. >> >> Richard. >> >>> thanks, >>> >>> David >>> >>> >>> >>>> Richard. >>>> >>>>>thanks, >>>>> >>>>>David >>>>> >>>>> >>>>>> >>>>>> Richard. >>>>>> >>>>>>>> >>>>>>>> Vectorization has great performance potential -- the more people >>>>>use >>>>>>>> it, the likely it will be further improved -- turning it on at O2 >>>>>is >>>>>>>> the way to go ... >>>>>>>> >>>>>>>> >>>>>>>> Thank you! >>>>>>>> >>>>>>>> >>>>>>>> Cong Hou >>>>>> >>>>>> >>>> >>>>