Re: Propose moving vectorization from -O3 to -O2.

Xinliang David Li Fri, 23 Aug 2013 07:48:51 -0700

On Fri, Aug 23, 2013 at 5:16 AM, Richard Biener
<richard.guent...@gmail.com> wrote:
> Xinliang David Li <davi...@google.com> wrote:
>>Interesting idea!
>
> In the past have already arranged for re-use of the epilogue loop and the 
> scalar loop, so the situation was even worse.
>
> Note that re-use prevents complete peeling of the epilogue which is often 
> profitable.


This can be handled by cost/benefit analysis. If complete peeling is
possible, deemed profitable at a given opt level (O2 is more
conservative, O3 more aggressive in definition of 'being profitable),
and will actually be done by the complete unroll/peel pass, the
vectorizer can choose not to do the merging.


> Combining the prologue will introduce a mispredicted branch which can be 
> harmful.

The branch pattern won't be too irregular and not hard to predict by
the branch predictor.

David

>
> So, certainly interesting but not easily always profitable.
>
> Richard.
>
>>David
>>
>>On Thu, Aug 22, 2013 at 4:46 PM, Cong Hou <co...@google.com> wrote:
>>> Another opportunity to reduce the code size is combining the scalar
>>version
>>> from loop versioning, the prolog and the epilog of loop peeling. I
>>manually
>>> made the following function for foo(). The running time does not
>>change (for
>>> corei7 since I use _mm_loadu_ps()) but the text size (for the
>>function only)
>>> reduces from 342 to 240 (41 for non-vectorized version). We can get
>>more
>>> benefit if the loop body is larger.
>>>
>>>
>>> void foo2 (TYPE * a, TYPE* b, TYPE * c, int n)
>>> {
>>>   int i, m, next;
>>>   __m128 veca, vecb, vecc;
>>>
>>>   i = 0;
>>>
>>>   if ((b >= a+4 | b+4 <= a) &
>>>       (c >= a+4 | c+4 <= a))
>>>   {
>>>     m = ((unsigned long)a & 127) >> 5;
>>>     goto L2;
>>>
>>> L1:
>>>     for (; i < n; i+=4)
>>>     {
>>>       vecb = _mm_loadu_ps(b+i);
>>>       vecc = _mm_loadu_ps(c+i);
>>>       veca = _mm_mul_ps(vecb, vecc);
>>>       _mm_store_ps(a+i, veca);
>>>     }
>>>     m = (i == n) ? n : n+4;
>>>   }
>>>
>>> L2:
>>>   for (; i < m; i++)
>>>     a[i] = b[i] * c[i];
>>>   if (i < n)
>>>     goto L1;
>>> }
>>>
>>>
>>>
>>> thanks,
>>>
>>> Cong
>>>
>>>
>>> On Wed, Aug 21, 2013 at 11:50 PM, Xinliang David Li
>><davi...@google.com>
>>> wrote:
>>>>
>>>> > The effect on runtime is not correlated to
>>>> > either (which means the vectorizer cost model is rather bad), but
>>>> > integer
>>>> > code usually does not benefit at all.
>>>>
>>>> The cost model does need some tuning. For instance, GCC vectorizer
>>>> does peeling aggressively, but  peeling in many cases can be avoided
>>>> while still gaining good performance -- even when target does not
>>have
>>>> efficient unaligned load/store to implement unaligned access. GCC
>>>> reports too high cost for unaligned access while too low for peeling
>>>> overhead.
>>>>
>>>> Example:
>>>>
>>>> ifndef TYPE
>>>> #define TYPE float
>>>> #endif
>>>> #include <stdlib.h>
>>>>
>>>> __attribute__((noinline)) void
>>>> foo (TYPE *a, TYPE* b, TYPE *c, int n)
>>>> {
>>>>    int i;
>>>>    for ( i = 0; i < n; i++)
>>>>      a[i] = b[i] * c[i];
>>>> }
>>>>
>>>> int g;
>>>> int
>>>> main()
>>>> {
>>>>    int i;
>>>>    float *a = (float*) malloc (100000*4);
>>>>    float *b = (float*) malloc (100000*4);
>>>>    float *c = (float*) malloc (100000*4);
>>>>
>>>>    for (i = 0; i < 100000; i++)
>>>>       foo(a, b, c, 100000);
>>>>
>>>>
>>>>    g = a[10];
>>>>
>>>> }
>>>>
>>>>
>>>> 1) by default, GCC's vectorizer will peel the loop in foo, so that
>>>> access to 'a' is aligned and using movaps instruction. The other
>>>> accesses are using movups when -march=corei7 is used
>>>> 2) Same as above, but -march=x86_64. Access to b is split into
>>'movlps
>>>> and movhps', same for 'c'
>>>>
>>>> 3) Disabling peeling (via a hack) with -march=corei7 --- all three
>>>> accesses are using movups
>>>> 4) Disabling peeling, with -march=x86-64 -- all three accesses are
>>>> using movlps/movhps
>>>>
>>>> Performance:
>>>>
>>>> 1) and 3) -- both 1.58s, but 3) is much smaller than 1). 3)'s text
>>is
>>>> 1462 bytes, and 1) is 1622 bytes
>>>> 3) and 4) and no vectorize -- all very slow -- 4.8s
>>>>
>>>> Observations:
>>>> a)  if properly tuned, for corei7, 3) should be picked by GCC
>>instead
>>>> of 1) -- this is not possible today
>>>> b) with march=x86_64, GCC should figure out the benefit of
>>vectorizing
>>>> the loop is small and bail out
>>>>
>>>> >> On the other hand, 10% compile time increase due to one pass
>>sounds
>>>> >> excessive -- there might be some low hanging fruit to reduce the
>>>> >> compile time increase.
>>>> >
>>>> > I have already spent two man-month speeding up the vectorizer
>>itself,
>>>> > I don't think there is any low-hanging fruit left there.  But see
>>above
>>>> > - most
>>>> > of the compile-time is due to the cost of processing the extra
>>loop
>>>> > copies.
>>>> >
>>>>
>>>> Ok.
>>>>
>>>> I did not notice your patch (in May this year) until recently. Do
>>you
>>>> plan to check it in (other than the part to turn in at O2). The cost
>>>> model part of the changes are largely independent. If it is in, it
>>>> will serve as a good basis for further tuning.
>>>>
>>>>
>>>> >>  at full feature set vectorization regresses runtime of quite a
>>number
>>>> >> of benchmarks significantly. At reduced feature set - basically
>>trying
>>>> >> to vectorize only obvious profitable cases - these regressions
>>can be
>>>> >> avoided but progressions only remain on two spec fp cases. As
>>most
>>>> >> user applications fall into the spec int category a 10%
>>compile-time
>>>> >> and 15% code-size regression for no gain is no good.
>>>> >>>
>>>> >>
>>>> >> Cong's data (especially corei7 and corei7avx) shows more
>>significant
>>>> >> performance improvement.   If 10% compile time increase is across
>>the
>>>> >> board and happens on benchmarks with no performance improvement,
>>it is
>>>> >> certainly bad - but I am not sure if that is the case.
>>>> >
>>>> > Note that we are talking about -O2 - people that enable
>>-march=corei7
>>>> > usually
>>>> > know to use -O3 or FDO anyway.
>>>>
>>>> Many people uses FDO, but not all -- there are still some barriers
>>for
>>>> adoption. There are reasons people may not want to use O3:
>>>> 1) people feel most comfortable to use O2 because it is considered
>>the
>>>> most thoroughly tested compiler optimization level;  Going with the
>>>> default is the natural choice. FDO is a different beast as the
>>>> performance benefit can be too high to resist;
>>>> 2) In a distributed build environment with object file
>>>> caching/sharing, building with O3 (different from the default) leads
>>>> to longer build time;
>>>> 3) The size/compile time cost can be too high with O3. On the other
>>>> hand, the benefit of vectorizer can be very high for many types of
>>>> applications such as image processing, stitching, image detection,
>>>> dsp, encoder/decoder -- other than numerical fortran programs.
>>>>
>>>>
>>>> > That said, I expect 99% of used software
>>>> > (probably rather 99,99999%) is not compiled on the system it runs
>>on but
>>>> > compiled to run on generic hardware and thus restricts itself to
>>bare
>>>> > x86_64
>>>> > SSE2 features.  So what matters for enabling the vectorizer at -O2
>>is
>>>> > the
>>>> > default architecture features of the given architecture(!) -
>>remember
>>>> > to not only
>>>> > consider x86 here!
>>>> >
>>>> >> A couple of points I'd like to make:
>>>> >>
>>>> >> 1) loop vectorizer passes the quality threshold to be turned on
>>by
>>>> >> default at O2 in 4.9; It is already turned on for FDO at O2.
>>>> >
>>>> > With FDO we have a _much_ better way of reasoning on which loops
>>>> > we spend the compile-time and code-size!  Exactly the problem that
>>>> > exists without FDO at -O2 (and also at -O3, but -O3 is not said to
>>>> > be well-balanced with regard to compile-time and code-size)
>>>> >
>>>> >> 2) there are still lots of room for improvement for loop
>>vectorizer --
>>>> >> there is no doubt about it, and we will need to continue
>>improving it;
>>>> >
>>>> > I believe we have to first do that.  See the patches regarding to
>>the
>>>> > cost model reorg I posted with the proposal for enabling
>>vectorization
>>>> > at -O2.
>>>> > One large source of collateral damage of vectorization is
>>if-conversion
>>>> > which
>>>> > aggressively if-converts loops regardless of us later vectorizing
>>the
>>>> > result.
>>>> > The if-conversion pass needs to be integrated with vectorization.
>>>>
>>>> We notice some small performance problems with tree-if conversion
>>that
>>>> is turned on with FDO -- because that pass does not have  cost model
>>>> (by looking at branch probability as rtl level if-cvt). What other
>>>> problems do you see? is it just compile time concern?
>>>>
>>>> >
>>>> >> 3) the only fast way to improve a feature is to get it used
>>widely so
>>>> >> that people can file bugs and report problems -- it is hard for
>>>> >> developers to find and collect all cases where GCC is weak
>>without GCC
>>>> >> community's help; There might be a temporary regression for some
>>>> >> users, but it is worth the pain
>>>> >
>>>> > Well, introducing known regressions at -O2 is not how this works.
>>>> > Vectorization is already widely tested and you can look at a
>>plethora of
>>>> > bugreports about missed features and vectorizer wrong-doings to
>>improve
>>>> > it.
>>>> >
>>>> >> 4) Not the most important one, but a practical concern:  without
>>>> >> turning it on, GCC will be greatly disadvantaged when people
>>start
>>>> >> doing benchmarking latest GCC against other compilers ..
>>>> >
>>>> > The same argument was done on the fact that GCC does not optimize
>>by
>>>> > default
>>>> > but uses -O0.  It's a straw-mans argument.  All "benchmarking" I
>>see
>>>> > uses
>>>> > -O3 or -Ofast already.
>>>>
>>>> People can just do -O2 performance comparison.
>>>>
>>>> thanks,
>>>>
>>>> David
>>>>
>>>> > To make vectorization have a bigger impact on day-to-day software
>>GCC
>>>> > would need
>>>> > to start versioning for the target sub-architecture - which of
>>course
>>>> > increases the
>>>> > issue with code-size and compile-time.
>>>> >
>>>> > Richard.
>>>> >
>>>> >> thanks,
>>>> >>
>>>> >> David
>>>> >>
>>>> >>
>>>> >>
>>>> >>> Richard.
>>>> >>>
>>>> >>>>thanks,
>>>> >>>>
>>>> >>>>David
>>>> >>>>
>>>> >>>>
>>>> >>>>>
>>>> >>>>> Richard.
>>>> >>>>>
>>>> >>>>>>>
>>>> >>>>>>> Vectorization has great performance potential -- the more
>>people
>>>> >>>>use
>>>> >>>>>>> it, the likely it will be further improved -- turning it on
>>at O2
>>>> >>>>is
>>>> >>>>>>> the way to go ...
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> Thank you!
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> Cong Hou
>>>> >>>>>
>>>> >>>>>
>>>> >>>
>>>> >>>
>>>
>>>
>
>

Re: Propose moving vectorization from -O3 to -O2.

Reply via email to