Xinliang David Li <davi...@google.com> wrote: >Interesting idea! In the past have already arranged for re-use of the epilogue loop and the scalar loop, so the situation was even worse.
Note that re-use prevents complete peeling of the epilogue which is often profitable. Combining the prologue will introduce a mispredicted branch which can be harmful. So, certainly interesting but not easily always profitable. Richard. >David > >On Thu, Aug 22, 2013 at 4:46 PM, Cong Hou <co...@google.com> wrote: >> Another opportunity to reduce the code size is combining the scalar >version >> from loop versioning, the prolog and the epilog of loop peeling. I >manually >> made the following function for foo(). The running time does not >change (for >> corei7 since I use _mm_loadu_ps()) but the text size (for the >function only) >> reduces from 342 to 240 (41 for non-vectorized version). We can get >more >> benefit if the loop body is larger. >> >> >> void foo2 (TYPE * a, TYPE* b, TYPE * c, int n) >> { >> int i, m, next; >> __m128 veca, vecb, vecc; >> >> i = 0; >> >> if ((b >= a+4 | b+4 <= a) & >> (c >= a+4 | c+4 <= a)) >> { >> m = ((unsigned long)a & 127) >> 5; >> goto L2; >> >> L1: >> for (; i < n; i+=4) >> { >> vecb = _mm_loadu_ps(b+i); >> vecc = _mm_loadu_ps(c+i); >> veca = _mm_mul_ps(vecb, vecc); >> _mm_store_ps(a+i, veca); >> } >> m = (i == n) ? n : n+4; >> } >> >> L2: >> for (; i < m; i++) >> a[i] = b[i] * c[i]; >> if (i < n) >> goto L1; >> } >> >> >> >> thanks, >> >> Cong >> >> >> On Wed, Aug 21, 2013 at 11:50 PM, Xinliang David Li ><davi...@google.com> >> wrote: >>> >>> > The effect on runtime is not correlated to >>> > either (which means the vectorizer cost model is rather bad), but >>> > integer >>> > code usually does not benefit at all. >>> >>> The cost model does need some tuning. For instance, GCC vectorizer >>> does peeling aggressively, but peeling in many cases can be avoided >>> while still gaining good performance -- even when target does not >have >>> efficient unaligned load/store to implement unaligned access. GCC >>> reports too high cost for unaligned access while too low for peeling >>> overhead. >>> >>> Example: >>> >>> ifndef TYPE >>> #define TYPE float >>> #endif >>> #include <stdlib.h> >>> >>> __attribute__((noinline)) void >>> foo (TYPE *a, TYPE* b, TYPE *c, int n) >>> { >>> int i; >>> for ( i = 0; i < n; i++) >>> a[i] = b[i] * c[i]; >>> } >>> >>> int g; >>> int >>> main() >>> { >>> int i; >>> float *a = (float*) malloc (100000*4); >>> float *b = (float*) malloc (100000*4); >>> float *c = (float*) malloc (100000*4); >>> >>> for (i = 0; i < 100000; i++) >>> foo(a, b, c, 100000); >>> >>> >>> g = a[10]; >>> >>> } >>> >>> >>> 1) by default, GCC's vectorizer will peel the loop in foo, so that >>> access to 'a' is aligned and using movaps instruction. The other >>> accesses are using movups when -march=corei7 is used >>> 2) Same as above, but -march=x86_64. Access to b is split into >'movlps >>> and movhps', same for 'c' >>> >>> 3) Disabling peeling (via a hack) with -march=corei7 --- all three >>> accesses are using movups >>> 4) Disabling peeling, with -march=x86-64 -- all three accesses are >>> using movlps/movhps >>> >>> Performance: >>> >>> 1) and 3) -- both 1.58s, but 3) is much smaller than 1). 3)'s text >is >>> 1462 bytes, and 1) is 1622 bytes >>> 3) and 4) and no vectorize -- all very slow -- 4.8s >>> >>> Observations: >>> a) if properly tuned, for corei7, 3) should be picked by GCC >instead >>> of 1) -- this is not possible today >>> b) with march=x86_64, GCC should figure out the benefit of >vectorizing >>> the loop is small and bail out >>> >>> >> On the other hand, 10% compile time increase due to one pass >sounds >>> >> excessive -- there might be some low hanging fruit to reduce the >>> >> compile time increase. >>> > >>> > I have already spent two man-month speeding up the vectorizer >itself, >>> > I don't think there is any low-hanging fruit left there. But see >above >>> > - most >>> > of the compile-time is due to the cost of processing the extra >loop >>> > copies. >>> > >>> >>> Ok. >>> >>> I did not notice your patch (in May this year) until recently. Do >you >>> plan to check it in (other than the part to turn in at O2). The cost >>> model part of the changes are largely independent. If it is in, it >>> will serve as a good basis for further tuning. >>> >>> >>> >> at full feature set vectorization regresses runtime of quite a >number >>> >> of benchmarks significantly. At reduced feature set - basically >trying >>> >> to vectorize only obvious profitable cases - these regressions >can be >>> >> avoided but progressions only remain on two spec fp cases. As >most >>> >> user applications fall into the spec int category a 10% >compile-time >>> >> and 15% code-size regression for no gain is no good. >>> >>> >>> >> >>> >> Cong's data (especially corei7 and corei7avx) shows more >significant >>> >> performance improvement. If 10% compile time increase is across >the >>> >> board and happens on benchmarks with no performance improvement, >it is >>> >> certainly bad - but I am not sure if that is the case. >>> > >>> > Note that we are talking about -O2 - people that enable >-march=corei7 >>> > usually >>> > know to use -O3 or FDO anyway. >>> >>> Many people uses FDO, but not all -- there are still some barriers >for >>> adoption. There are reasons people may not want to use O3: >>> 1) people feel most comfortable to use O2 because it is considered >the >>> most thoroughly tested compiler optimization level; Going with the >>> default is the natural choice. FDO is a different beast as the >>> performance benefit can be too high to resist; >>> 2) In a distributed build environment with object file >>> caching/sharing, building with O3 (different from the default) leads >>> to longer build time; >>> 3) The size/compile time cost can be too high with O3. On the other >>> hand, the benefit of vectorizer can be very high for many types of >>> applications such as image processing, stitching, image detection, >>> dsp, encoder/decoder -- other than numerical fortran programs. >>> >>> >>> > That said, I expect 99% of used software >>> > (probably rather 99,99999%) is not compiled on the system it runs >on but >>> > compiled to run on generic hardware and thus restricts itself to >bare >>> > x86_64 >>> > SSE2 features. So what matters for enabling the vectorizer at -O2 >is >>> > the >>> > default architecture features of the given architecture(!) - >remember >>> > to not only >>> > consider x86 here! >>> > >>> >> A couple of points I'd like to make: >>> >> >>> >> 1) loop vectorizer passes the quality threshold to be turned on >by >>> >> default at O2 in 4.9; It is already turned on for FDO at O2. >>> > >>> > With FDO we have a _much_ better way of reasoning on which loops >>> > we spend the compile-time and code-size! Exactly the problem that >>> > exists without FDO at -O2 (and also at -O3, but -O3 is not said to >>> > be well-balanced with regard to compile-time and code-size) >>> > >>> >> 2) there are still lots of room for improvement for loop >vectorizer -- >>> >> there is no doubt about it, and we will need to continue >improving it; >>> > >>> > I believe we have to first do that. See the patches regarding to >the >>> > cost model reorg I posted with the proposal for enabling >vectorization >>> > at -O2. >>> > One large source of collateral damage of vectorization is >if-conversion >>> > which >>> > aggressively if-converts loops regardless of us later vectorizing >the >>> > result. >>> > The if-conversion pass needs to be integrated with vectorization. >>> >>> We notice some small performance problems with tree-if conversion >that >>> is turned on with FDO -- because that pass does not have cost model >>> (by looking at branch probability as rtl level if-cvt). What other >>> problems do you see? is it just compile time concern? >>> >>> > >>> >> 3) the only fast way to improve a feature is to get it used >widely so >>> >> that people can file bugs and report problems -- it is hard for >>> >> developers to find and collect all cases where GCC is weak >without GCC >>> >> community's help; There might be a temporary regression for some >>> >> users, but it is worth the pain >>> > >>> > Well, introducing known regressions at -O2 is not how this works. >>> > Vectorization is already widely tested and you can look at a >plethora of >>> > bugreports about missed features and vectorizer wrong-doings to >improve >>> > it. >>> > >>> >> 4) Not the most important one, but a practical concern: without >>> >> turning it on, GCC will be greatly disadvantaged when people >start >>> >> doing benchmarking latest GCC against other compilers .. >>> > >>> > The same argument was done on the fact that GCC does not optimize >by >>> > default >>> > but uses -O0. It's a straw-mans argument. All "benchmarking" I >see >>> > uses >>> > -O3 or -Ofast already. >>> >>> People can just do -O2 performance comparison. >>> >>> thanks, >>> >>> David >>> >>> > To make vectorization have a bigger impact on day-to-day software >GCC >>> > would need >>> > to start versioning for the target sub-architecture - which of >course >>> > increases the >>> > issue with code-size and compile-time. >>> > >>> > Richard. >>> > >>> >> thanks, >>> >> >>> >> David >>> >> >>> >> >>> >> >>> >>> Richard. >>> >>> >>> >>>>thanks, >>> >>>> >>> >>>>David >>> >>>> >>> >>>> >>> >>>>> >>> >>>>> Richard. >>> >>>>> >>> >>>>>>> >>> >>>>>>> Vectorization has great performance potential -- the more >people >>> >>>>use >>> >>>>>>> it, the likely it will be further improved -- turning it on >at O2 >>> >>>>is >>> >>>>>>> the way to go ... >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> Thank you! >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> Cong Hou >>> >>>>> >>> >>>>> >>> >>> >>> >>> >> >>