On Fri, Aug 23, 2013 at 5:16 AM, Richard Biener <richard.guent...@gmail.com> wrote: > Xinliang David Li <davi...@google.com> wrote: >>Interesting idea! > > In the past have already arranged for re-use of the epilogue loop and the > scalar loop, so the situation was even worse. > > Note that re-use prevents complete peeling of the epilogue which is often > profitable.
This can be handled by cost/benefit analysis. If complete peeling is possible, deemed profitable at a given opt level (O2 is more conservative, O3 more aggressive in definition of 'being profitable), and will actually be done by the complete unroll/peel pass, the vectorizer can choose not to do the merging. > Combining the prologue will introduce a mispredicted branch which can be > harmful. The branch pattern won't be too irregular and not hard to predict by the branch predictor. David > > So, certainly interesting but not easily always profitable. > > Richard. > >>David >> >>On Thu, Aug 22, 2013 at 4:46 PM, Cong Hou <co...@google.com> wrote: >>> Another opportunity to reduce the code size is combining the scalar >>version >>> from loop versioning, the prolog and the epilog of loop peeling. I >>manually >>> made the following function for foo(). The running time does not >>change (for >>> corei7 since I use _mm_loadu_ps()) but the text size (for the >>function only) >>> reduces from 342 to 240 (41 for non-vectorized version). We can get >>more >>> benefit if the loop body is larger. >>> >>> >>> void foo2 (TYPE * a, TYPE* b, TYPE * c, int n) >>> { >>> int i, m, next; >>> __m128 veca, vecb, vecc; >>> >>> i = 0; >>> >>> if ((b >= a+4 | b+4 <= a) & >>> (c >= a+4 | c+4 <= a)) >>> { >>> m = ((unsigned long)a & 127) >> 5; >>> goto L2; >>> >>> L1: >>> for (; i < n; i+=4) >>> { >>> vecb = _mm_loadu_ps(b+i); >>> vecc = _mm_loadu_ps(c+i); >>> veca = _mm_mul_ps(vecb, vecc); >>> _mm_store_ps(a+i, veca); >>> } >>> m = (i == n) ? n : n+4; >>> } >>> >>> L2: >>> for (; i < m; i++) >>> a[i] = b[i] * c[i]; >>> if (i < n) >>> goto L1; >>> } >>> >>> >>> >>> thanks, >>> >>> Cong >>> >>> >>> On Wed, Aug 21, 2013 at 11:50 PM, Xinliang David Li >><davi...@google.com> >>> wrote: >>>> >>>> > The effect on runtime is not correlated to >>>> > either (which means the vectorizer cost model is rather bad), but >>>> > integer >>>> > code usually does not benefit at all. >>>> >>>> The cost model does need some tuning. For instance, GCC vectorizer >>>> does peeling aggressively, but peeling in many cases can be avoided >>>> while still gaining good performance -- even when target does not >>have >>>> efficient unaligned load/store to implement unaligned access. GCC >>>> reports too high cost for unaligned access while too low for peeling >>>> overhead. >>>> >>>> Example: >>>> >>>> ifndef TYPE >>>> #define TYPE float >>>> #endif >>>> #include <stdlib.h> >>>> >>>> __attribute__((noinline)) void >>>> foo (TYPE *a, TYPE* b, TYPE *c, int n) >>>> { >>>> int i; >>>> for ( i = 0; i < n; i++) >>>> a[i] = b[i] * c[i]; >>>> } >>>> >>>> int g; >>>> int >>>> main() >>>> { >>>> int i; >>>> float *a = (float*) malloc (100000*4); >>>> float *b = (float*) malloc (100000*4); >>>> float *c = (float*) malloc (100000*4); >>>> >>>> for (i = 0; i < 100000; i++) >>>> foo(a, b, c, 100000); >>>> >>>> >>>> g = a[10]; >>>> >>>> } >>>> >>>> >>>> 1) by default, GCC's vectorizer will peel the loop in foo, so that >>>> access to 'a' is aligned and using movaps instruction. The other >>>> accesses are using movups when -march=corei7 is used >>>> 2) Same as above, but -march=x86_64. Access to b is split into >>'movlps >>>> and movhps', same for 'c' >>>> >>>> 3) Disabling peeling (via a hack) with -march=corei7 --- all three >>>> accesses are using movups >>>> 4) Disabling peeling, with -march=x86-64 -- all three accesses are >>>> using movlps/movhps >>>> >>>> Performance: >>>> >>>> 1) and 3) -- both 1.58s, but 3) is much smaller than 1). 3)'s text >>is >>>> 1462 bytes, and 1) is 1622 bytes >>>> 3) and 4) and no vectorize -- all very slow -- 4.8s >>>> >>>> Observations: >>>> a) if properly tuned, for corei7, 3) should be picked by GCC >>instead >>>> of 1) -- this is not possible today >>>> b) with march=x86_64, GCC should figure out the benefit of >>vectorizing >>>> the loop is small and bail out >>>> >>>> >> On the other hand, 10% compile time increase due to one pass >>sounds >>>> >> excessive -- there might be some low hanging fruit to reduce the >>>> >> compile time increase. >>>> > >>>> > I have already spent two man-month speeding up the vectorizer >>itself, >>>> > I don't think there is any low-hanging fruit left there. But see >>above >>>> > - most >>>> > of the compile-time is due to the cost of processing the extra >>loop >>>> > copies. >>>> > >>>> >>>> Ok. >>>> >>>> I did not notice your patch (in May this year) until recently. Do >>you >>>> plan to check it in (other than the part to turn in at O2). The cost >>>> model part of the changes are largely independent. If it is in, it >>>> will serve as a good basis for further tuning. >>>> >>>> >>>> >> at full feature set vectorization regresses runtime of quite a >>number >>>> >> of benchmarks significantly. At reduced feature set - basically >>trying >>>> >> to vectorize only obvious profitable cases - these regressions >>can be >>>> >> avoided but progressions only remain on two spec fp cases. As >>most >>>> >> user applications fall into the spec int category a 10% >>compile-time >>>> >> and 15% code-size regression for no gain is no good. >>>> >>> >>>> >> >>>> >> Cong's data (especially corei7 and corei7avx) shows more >>significant >>>> >> performance improvement. If 10% compile time increase is across >>the >>>> >> board and happens on benchmarks with no performance improvement, >>it is >>>> >> certainly bad - but I am not sure if that is the case. >>>> > >>>> > Note that we are talking about -O2 - people that enable >>-march=corei7 >>>> > usually >>>> > know to use -O3 or FDO anyway. >>>> >>>> Many people uses FDO, but not all -- there are still some barriers >>for >>>> adoption. There are reasons people may not want to use O3: >>>> 1) people feel most comfortable to use O2 because it is considered >>the >>>> most thoroughly tested compiler optimization level; Going with the >>>> default is the natural choice. FDO is a different beast as the >>>> performance benefit can be too high to resist; >>>> 2) In a distributed build environment with object file >>>> caching/sharing, building with O3 (different from the default) leads >>>> to longer build time; >>>> 3) The size/compile time cost can be too high with O3. On the other >>>> hand, the benefit of vectorizer can be very high for many types of >>>> applications such as image processing, stitching, image detection, >>>> dsp, encoder/decoder -- other than numerical fortran programs. >>>> >>>> >>>> > That said, I expect 99% of used software >>>> > (probably rather 99,99999%) is not compiled on the system it runs >>on but >>>> > compiled to run on generic hardware and thus restricts itself to >>bare >>>> > x86_64 >>>> > SSE2 features. So what matters for enabling the vectorizer at -O2 >>is >>>> > the >>>> > default architecture features of the given architecture(!) - >>remember >>>> > to not only >>>> > consider x86 here! >>>> > >>>> >> A couple of points I'd like to make: >>>> >> >>>> >> 1) loop vectorizer passes the quality threshold to be turned on >>by >>>> >> default at O2 in 4.9; It is already turned on for FDO at O2. >>>> > >>>> > With FDO we have a _much_ better way of reasoning on which loops >>>> > we spend the compile-time and code-size! Exactly the problem that >>>> > exists without FDO at -O2 (and also at -O3, but -O3 is not said to >>>> > be well-balanced with regard to compile-time and code-size) >>>> > >>>> >> 2) there are still lots of room for improvement for loop >>vectorizer -- >>>> >> there is no doubt about it, and we will need to continue >>improving it; >>>> > >>>> > I believe we have to first do that. See the patches regarding to >>the >>>> > cost model reorg I posted with the proposal for enabling >>vectorization >>>> > at -O2. >>>> > One large source of collateral damage of vectorization is >>if-conversion >>>> > which >>>> > aggressively if-converts loops regardless of us later vectorizing >>the >>>> > result. >>>> > The if-conversion pass needs to be integrated with vectorization. >>>> >>>> We notice some small performance problems with tree-if conversion >>that >>>> is turned on with FDO -- because that pass does not have cost model >>>> (by looking at branch probability as rtl level if-cvt). What other >>>> problems do you see? is it just compile time concern? >>>> >>>> > >>>> >> 3) the only fast way to improve a feature is to get it used >>widely so >>>> >> that people can file bugs and report problems -- it is hard for >>>> >> developers to find and collect all cases where GCC is weak >>without GCC >>>> >> community's help; There might be a temporary regression for some >>>> >> users, but it is worth the pain >>>> > >>>> > Well, introducing known regressions at -O2 is not how this works. >>>> > Vectorization is already widely tested and you can look at a >>plethora of >>>> > bugreports about missed features and vectorizer wrong-doings to >>improve >>>> > it. >>>> > >>>> >> 4) Not the most important one, but a practical concern: without >>>> >> turning it on, GCC will be greatly disadvantaged when people >>start >>>> >> doing benchmarking latest GCC against other compilers .. >>>> > >>>> > The same argument was done on the fact that GCC does not optimize >>by >>>> > default >>>> > but uses -O0. It's a straw-mans argument. All "benchmarking" I >>see >>>> > uses >>>> > -O3 or -Ofast already. >>>> >>>> People can just do -O2 performance comparison. >>>> >>>> thanks, >>>> >>>> David >>>> >>>> > To make vectorization have a bigger impact on day-to-day software >>GCC >>>> > would need >>>> > to start versioning for the target sub-architecture - which of >>course >>>> > increases the >>>> > issue with code-size and compile-time. >>>> > >>>> > Richard. >>>> > >>>> >> thanks, >>>> >> >>>> >> David >>>> >> >>>> >> >>>> >> >>>> >>> Richard. >>>> >>> >>>> >>>>thanks, >>>> >>>> >>>> >>>>David >>>> >>>> >>>> >>>> >>>> >>>>> >>>> >>>>> Richard. >>>> >>>>> >>>> >>>>>>> >>>> >>>>>>> Vectorization has great performance potential -- the more >>people >>>> >>>>use >>>> >>>>>>> it, the likely it will be further improved -- turning it on >>at O2 >>>> >>>>is >>>> >>>>>>> the way to go ... >>>> >>>>>>> >>>> >>>>>>> >>>> >>>>>>> Thank you! >>>> >>>>>>> >>>> >>>>>>> >>>> >>>>>>> Cong Hou >>>> >>>>> >>>> >>>>> >>>> >>> >>>> >>> >>> >>> > >