Xinliang David Li <davi...@google.com> wrote:
>Interesting idea!

In the past have already arranged for re-use of the epilogue loop and the 
scalar loop, so the situation was even worse.

Note that re-use prevents complete peeling of the epilogue which is often 
profitable.  Combining the prologue will introduce a mispredicted branch which 
can be harmful.

So, certainly interesting but not easily always profitable.

Richard.

>David
>
>On Thu, Aug 22, 2013 at 4:46 PM, Cong Hou <co...@google.com> wrote:
>> Another opportunity to reduce the code size is combining the scalar
>version
>> from loop versioning, the prolog and the epilog of loop peeling. I
>manually
>> made the following function for foo(). The running time does not
>change (for
>> corei7 since I use _mm_loadu_ps()) but the text size (for the
>function only)
>> reduces from 342 to 240 (41 for non-vectorized version). We can get
>more
>> benefit if the loop body is larger.
>>
>>
>> void foo2 (TYPE * a, TYPE* b, TYPE * c, int n)
>> {
>>   int i, m, next;
>>   __m128 veca, vecb, vecc;
>>
>>   i = 0;
>>
>>   if ((b >= a+4 | b+4 <= a) &
>>       (c >= a+4 | c+4 <= a))
>>   {
>>     m = ((unsigned long)a & 127) >> 5;
>>     goto L2;
>>
>> L1:
>>     for (; i < n; i+=4)
>>     {
>>       vecb = _mm_loadu_ps(b+i);
>>       vecc = _mm_loadu_ps(c+i);
>>       veca = _mm_mul_ps(vecb, vecc);
>>       _mm_store_ps(a+i, veca);
>>     }
>>     m = (i == n) ? n : n+4;
>>   }
>>
>> L2:
>>   for (; i < m; i++)
>>     a[i] = b[i] * c[i];
>>   if (i < n)
>>     goto L1;
>> }
>>
>>
>>
>> thanks,
>>
>> Cong
>>
>>
>> On Wed, Aug 21, 2013 at 11:50 PM, Xinliang David Li
><davi...@google.com>
>> wrote:
>>>
>>> > The effect on runtime is not correlated to
>>> > either (which means the vectorizer cost model is rather bad), but
>>> > integer
>>> > code usually does not benefit at all.
>>>
>>> The cost model does need some tuning. For instance, GCC vectorizer
>>> does peeling aggressively, but  peeling in many cases can be avoided
>>> while still gaining good performance -- even when target does not
>have
>>> efficient unaligned load/store to implement unaligned access. GCC
>>> reports too high cost for unaligned access while too low for peeling
>>> overhead.
>>>
>>> Example:
>>>
>>> ifndef TYPE
>>> #define TYPE float
>>> #endif
>>> #include <stdlib.h>
>>>
>>> __attribute__((noinline)) void
>>> foo (TYPE *a, TYPE* b, TYPE *c, int n)
>>> {
>>>    int i;
>>>    for ( i = 0; i < n; i++)
>>>      a[i] = b[i] * c[i];
>>> }
>>>
>>> int g;
>>> int
>>> main()
>>> {
>>>    int i;
>>>    float *a = (float*) malloc (100000*4);
>>>    float *b = (float*) malloc (100000*4);
>>>    float *c = (float*) malloc (100000*4);
>>>
>>>    for (i = 0; i < 100000; i++)
>>>       foo(a, b, c, 100000);
>>>
>>>
>>>    g = a[10];
>>>
>>> }
>>>
>>>
>>> 1) by default, GCC's vectorizer will peel the loop in foo, so that
>>> access to 'a' is aligned and using movaps instruction. The other
>>> accesses are using movups when -march=corei7 is used
>>> 2) Same as above, but -march=x86_64. Access to b is split into
>'movlps
>>> and movhps', same for 'c'
>>>
>>> 3) Disabling peeling (via a hack) with -march=corei7 --- all three
>>> accesses are using movups
>>> 4) Disabling peeling, with -march=x86-64 -- all three accesses are
>>> using movlps/movhps
>>>
>>> Performance:
>>>
>>> 1) and 3) -- both 1.58s, but 3) is much smaller than 1). 3)'s text
>is
>>> 1462 bytes, and 1) is 1622 bytes
>>> 3) and 4) and no vectorize -- all very slow -- 4.8s
>>>
>>> Observations:
>>> a)  if properly tuned, for corei7, 3) should be picked by GCC
>instead
>>> of 1) -- this is not possible today
>>> b) with march=x86_64, GCC should figure out the benefit of
>vectorizing
>>> the loop is small and bail out
>>>
>>> >> On the other hand, 10% compile time increase due to one pass
>sounds
>>> >> excessive -- there might be some low hanging fruit to reduce the
>>> >> compile time increase.
>>> >
>>> > I have already spent two man-month speeding up the vectorizer
>itself,
>>> > I don't think there is any low-hanging fruit left there.  But see
>above
>>> > - most
>>> > of the compile-time is due to the cost of processing the extra
>loop
>>> > copies.
>>> >
>>>
>>> Ok.
>>>
>>> I did not notice your patch (in May this year) until recently. Do
>you
>>> plan to check it in (other than the part to turn in at O2). The cost
>>> model part of the changes are largely independent. If it is in, it
>>> will serve as a good basis for further tuning.
>>>
>>>
>>> >>  at full feature set vectorization regresses runtime of quite a
>number
>>> >> of benchmarks significantly. At reduced feature set - basically
>trying
>>> >> to vectorize only obvious profitable cases - these regressions
>can be
>>> >> avoided but progressions only remain on two spec fp cases. As
>most
>>> >> user applications fall into the spec int category a 10%
>compile-time
>>> >> and 15% code-size regression for no gain is no good.
>>> >>>
>>> >>
>>> >> Cong's data (especially corei7 and corei7avx) shows more
>significant
>>> >> performance improvement.   If 10% compile time increase is across
>the
>>> >> board and happens on benchmarks with no performance improvement,
>it is
>>> >> certainly bad - but I am not sure if that is the case.
>>> >
>>> > Note that we are talking about -O2 - people that enable
>-march=corei7
>>> > usually
>>> > know to use -O3 or FDO anyway.
>>>
>>> Many people uses FDO, but not all -- there are still some barriers
>for
>>> adoption. There are reasons people may not want to use O3:
>>> 1) people feel most comfortable to use O2 because it is considered
>the
>>> most thoroughly tested compiler optimization level;  Going with the
>>> default is the natural choice. FDO is a different beast as the
>>> performance benefit can be too high to resist;
>>> 2) In a distributed build environment with object file
>>> caching/sharing, building with O3 (different from the default) leads
>>> to longer build time;
>>> 3) The size/compile time cost can be too high with O3. On the other
>>> hand, the benefit of vectorizer can be very high for many types of
>>> applications such as image processing, stitching, image detection,
>>> dsp, encoder/decoder -- other than numerical fortran programs.
>>>
>>>
>>> > That said, I expect 99% of used software
>>> > (probably rather 99,99999%) is not compiled on the system it runs
>on but
>>> > compiled to run on generic hardware and thus restricts itself to
>bare
>>> > x86_64
>>> > SSE2 features.  So what matters for enabling the vectorizer at -O2
>is
>>> > the
>>> > default architecture features of the given architecture(!) -
>remember
>>> > to not only
>>> > consider x86 here!
>>> >
>>> >> A couple of points I'd like to make:
>>> >>
>>> >> 1) loop vectorizer passes the quality threshold to be turned on
>by
>>> >> default at O2 in 4.9; It is already turned on for FDO at O2.
>>> >
>>> > With FDO we have a _much_ better way of reasoning on which loops
>>> > we spend the compile-time and code-size!  Exactly the problem that
>>> > exists without FDO at -O2 (and also at -O3, but -O3 is not said to
>>> > be well-balanced with regard to compile-time and code-size)
>>> >
>>> >> 2) there are still lots of room for improvement for loop
>vectorizer --
>>> >> there is no doubt about it, and we will need to continue
>improving it;
>>> >
>>> > I believe we have to first do that.  See the patches regarding to
>the
>>> > cost model reorg I posted with the proposal for enabling
>vectorization
>>> > at -O2.
>>> > One large source of collateral damage of vectorization is
>if-conversion
>>> > which
>>> > aggressively if-converts loops regardless of us later vectorizing
>the
>>> > result.
>>> > The if-conversion pass needs to be integrated with vectorization.
>>>
>>> We notice some small performance problems with tree-if conversion
>that
>>> is turned on with FDO -- because that pass does not have  cost model
>>> (by looking at branch probability as rtl level if-cvt). What other
>>> problems do you see? is it just compile time concern?
>>>
>>> >
>>> >> 3) the only fast way to improve a feature is to get it used
>widely so
>>> >> that people can file bugs and report problems -- it is hard for
>>> >> developers to find and collect all cases where GCC is weak
>without GCC
>>> >> community's help; There might be a temporary regression for some
>>> >> users, but it is worth the pain
>>> >
>>> > Well, introducing known regressions at -O2 is not how this works.
>>> > Vectorization is already widely tested and you can look at a
>plethora of
>>> > bugreports about missed features and vectorizer wrong-doings to
>improve
>>> > it.
>>> >
>>> >> 4) Not the most important one, but a practical concern:  without
>>> >> turning it on, GCC will be greatly disadvantaged when people
>start
>>> >> doing benchmarking latest GCC against other compilers ..
>>> >
>>> > The same argument was done on the fact that GCC does not optimize
>by
>>> > default
>>> > but uses -O0.  It's a straw-mans argument.  All "benchmarking" I
>see
>>> > uses
>>> > -O3 or -Ofast already.
>>>
>>> People can just do -O2 performance comparison.
>>>
>>> thanks,
>>>
>>> David
>>>
>>> > To make vectorization have a bigger impact on day-to-day software
>GCC
>>> > would need
>>> > to start versioning for the target sub-architecture - which of
>course
>>> > increases the
>>> > issue with code-size and compile-time.
>>> >
>>> > Richard.
>>> >
>>> >> thanks,
>>> >>
>>> >> David
>>> >>
>>> >>
>>> >>
>>> >>> Richard.
>>> >>>
>>> >>>>thanks,
>>> >>>>
>>> >>>>David
>>> >>>>
>>> >>>>
>>> >>>>>
>>> >>>>> Richard.
>>> >>>>>
>>> >>>>>>>
>>> >>>>>>> Vectorization has great performance potential -- the more
>people
>>> >>>>use
>>> >>>>>>> it, the likely it will be further improved -- turning it on
>at O2
>>> >>>>is
>>> >>>>>>> the way to go ...
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> Thank you!
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> Cong Hou
>>> >>>>>
>>> >>>>>
>>> >>>
>>> >>>
>>
>>


Reply via email to