Interesting idea! David
On Thu, Aug 22, 2013 at 4:46 PM, Cong Hou <co...@google.com> wrote: > Another opportunity to reduce the code size is combining the scalar version > from loop versioning, the prolog and the epilog of loop peeling. I manually > made the following function for foo(). The running time does not change (for > corei7 since I use _mm_loadu_ps()) but the text size (for the function only) > reduces from 342 to 240 (41 for non-vectorized version). We can get more > benefit if the loop body is larger. > > > void foo2 (TYPE * a, TYPE* b, TYPE * c, int n) > { > int i, m, next; > __m128 veca, vecb, vecc; > > i = 0; > > if ((b >= a+4 | b+4 <= a) & > (c >= a+4 | c+4 <= a)) > { > m = ((unsigned long)a & 127) >> 5; > goto L2; > > L1: > for (; i < n; i+=4) > { > vecb = _mm_loadu_ps(b+i); > vecc = _mm_loadu_ps(c+i); > veca = _mm_mul_ps(vecb, vecc); > _mm_store_ps(a+i, veca); > } > m = (i == n) ? n : n+4; > } > > L2: > for (; i < m; i++) > a[i] = b[i] * c[i]; > if (i < n) > goto L1; > } > > > > thanks, > > Cong > > > On Wed, Aug 21, 2013 at 11:50 PM, Xinliang David Li <davi...@google.com> > wrote: >> >> > The effect on runtime is not correlated to >> > either (which means the vectorizer cost model is rather bad), but >> > integer >> > code usually does not benefit at all. >> >> The cost model does need some tuning. For instance, GCC vectorizer >> does peeling aggressively, but peeling in many cases can be avoided >> while still gaining good performance -- even when target does not have >> efficient unaligned load/store to implement unaligned access. GCC >> reports too high cost for unaligned access while too low for peeling >> overhead. >> >> Example: >> >> ifndef TYPE >> #define TYPE float >> #endif >> #include <stdlib.h> >> >> __attribute__((noinline)) void >> foo (TYPE *a, TYPE* b, TYPE *c, int n) >> { >> int i; >> for ( i = 0; i < n; i++) >> a[i] = b[i] * c[i]; >> } >> >> int g; >> int >> main() >> { >> int i; >> float *a = (float*) malloc (100000*4); >> float *b = (float*) malloc (100000*4); >> float *c = (float*) malloc (100000*4); >> >> for (i = 0; i < 100000; i++) >> foo(a, b, c, 100000); >> >> >> g = a[10]; >> >> } >> >> >> 1) by default, GCC's vectorizer will peel the loop in foo, so that >> access to 'a' is aligned and using movaps instruction. The other >> accesses are using movups when -march=corei7 is used >> 2) Same as above, but -march=x86_64. Access to b is split into 'movlps >> and movhps', same for 'c' >> >> 3) Disabling peeling (via a hack) with -march=corei7 --- all three >> accesses are using movups >> 4) Disabling peeling, with -march=x86-64 -- all three accesses are >> using movlps/movhps >> >> Performance: >> >> 1) and 3) -- both 1.58s, but 3) is much smaller than 1). 3)'s text is >> 1462 bytes, and 1) is 1622 bytes >> 3) and 4) and no vectorize -- all very slow -- 4.8s >> >> Observations: >> a) if properly tuned, for corei7, 3) should be picked by GCC instead >> of 1) -- this is not possible today >> b) with march=x86_64, GCC should figure out the benefit of vectorizing >> the loop is small and bail out >> >> >> On the other hand, 10% compile time increase due to one pass sounds >> >> excessive -- there might be some low hanging fruit to reduce the >> >> compile time increase. >> > >> > I have already spent two man-month speeding up the vectorizer itself, >> > I don't think there is any low-hanging fruit left there. But see above >> > - most >> > of the compile-time is due to the cost of processing the extra loop >> > copies. >> > >> >> Ok. >> >> I did not notice your patch (in May this year) until recently. Do you >> plan to check it in (other than the part to turn in at O2). The cost >> model part of the changes are largely independent. If it is in, it >> will serve as a good basis for further tuning. >> >> >> >> at full feature set vectorization regresses runtime of quite a number >> >> of benchmarks significantly. At reduced feature set - basically trying >> >> to vectorize only obvious profitable cases - these regressions can be >> >> avoided but progressions only remain on two spec fp cases. As most >> >> user applications fall into the spec int category a 10% compile-time >> >> and 15% code-size regression for no gain is no good. >> >>> >> >> >> >> Cong's data (especially corei7 and corei7avx) shows more significant >> >> performance improvement. If 10% compile time increase is across the >> >> board and happens on benchmarks with no performance improvement, it is >> >> certainly bad - but I am not sure if that is the case. >> > >> > Note that we are talking about -O2 - people that enable -march=corei7 >> > usually >> > know to use -O3 or FDO anyway. >> >> Many people uses FDO, but not all -- there are still some barriers for >> adoption. There are reasons people may not want to use O3: >> 1) people feel most comfortable to use O2 because it is considered the >> most thoroughly tested compiler optimization level; Going with the >> default is the natural choice. FDO is a different beast as the >> performance benefit can be too high to resist; >> 2) In a distributed build environment with object file >> caching/sharing, building with O3 (different from the default) leads >> to longer build time; >> 3) The size/compile time cost can be too high with O3. On the other >> hand, the benefit of vectorizer can be very high for many types of >> applications such as image processing, stitching, image detection, >> dsp, encoder/decoder -- other than numerical fortran programs. >> >> >> > That said, I expect 99% of used software >> > (probably rather 99,99999%) is not compiled on the system it runs on but >> > compiled to run on generic hardware and thus restricts itself to bare >> > x86_64 >> > SSE2 features. So what matters for enabling the vectorizer at -O2 is >> > the >> > default architecture features of the given architecture(!) - remember >> > to not only >> > consider x86 here! >> > >> >> A couple of points I'd like to make: >> >> >> >> 1) loop vectorizer passes the quality threshold to be turned on by >> >> default at O2 in 4.9; It is already turned on for FDO at O2. >> > >> > With FDO we have a _much_ better way of reasoning on which loops >> > we spend the compile-time and code-size! Exactly the problem that >> > exists without FDO at -O2 (and also at -O3, but -O3 is not said to >> > be well-balanced with regard to compile-time and code-size) >> > >> >> 2) there are still lots of room for improvement for loop vectorizer -- >> >> there is no doubt about it, and we will need to continue improving it; >> > >> > I believe we have to first do that. See the patches regarding to the >> > cost model reorg I posted with the proposal for enabling vectorization >> > at -O2. >> > One large source of collateral damage of vectorization is if-conversion >> > which >> > aggressively if-converts loops regardless of us later vectorizing the >> > result. >> > The if-conversion pass needs to be integrated with vectorization. >> >> We notice some small performance problems with tree-if conversion that >> is turned on with FDO -- because that pass does not have cost model >> (by looking at branch probability as rtl level if-cvt). What other >> problems do you see? is it just compile time concern? >> >> > >> >> 3) the only fast way to improve a feature is to get it used widely so >> >> that people can file bugs and report problems -- it is hard for >> >> developers to find and collect all cases where GCC is weak without GCC >> >> community's help; There might be a temporary regression for some >> >> users, but it is worth the pain >> > >> > Well, introducing known regressions at -O2 is not how this works. >> > Vectorization is already widely tested and you can look at a plethora of >> > bugreports about missed features and vectorizer wrong-doings to improve >> > it. >> > >> >> 4) Not the most important one, but a practical concern: without >> >> turning it on, GCC will be greatly disadvantaged when people start >> >> doing benchmarking latest GCC against other compilers .. >> > >> > The same argument was done on the fact that GCC does not optimize by >> > default >> > but uses -O0. It's a straw-mans argument. All "benchmarking" I see >> > uses >> > -O3 or -Ofast already. >> >> People can just do -O2 performance comparison. >> >> thanks, >> >> David >> >> > To make vectorization have a bigger impact on day-to-day software GCC >> > would need >> > to start versioning for the target sub-architecture - which of course >> > increases the >> > issue with code-size and compile-time. >> > >> > Richard. >> > >> >> thanks, >> >> >> >> David >> >> >> >> >> >> >> >>> Richard. >> >>> >> >>>>thanks, >> >>>> >> >>>>David >> >>>> >> >>>> >> >>>>> >> >>>>> Richard. >> >>>>> >> >>>>>>> >> >>>>>>> Vectorization has great performance potential -- the more people >> >>>>use >> >>>>>>> it, the likely it will be further improved -- turning it on at O2 >> >>>>is >> >>>>>>> the way to go ... >> >>>>>>> >> >>>>>>> >> >>>>>>> Thank you! >> >>>>>>> >> >>>>>>> >> >>>>>>> Cong Hou >> >>>>> >> >>>>> >> >>> >> >>> > >