Re: Propose moving vectorization from -O3 to -O2.

Xinliang David Li Thu, 22 Aug 2013 16:58:32 -0700

Interesting idea!

David


On Thu, Aug 22, 2013 at 4:46 PM, Cong Hou <co...@google.com> wrote:
> Another opportunity to reduce the code size is combining the scalar version
> from loop versioning, the prolog and the epilog of loop peeling. I manually
> made the following function for foo(). The running time does not change (for
> corei7 since I use _mm_loadu_ps()) but the text size (for the function only)
> reduces from 342 to 240 (41 for non-vectorized version). We can get more
> benefit if the loop body is larger.
>
>
> void foo2 (TYPE * a, TYPE* b, TYPE * c, int n)
> {
>   int i, m, next;
>   __m128 veca, vecb, vecc;
>
>   i = 0;
>
>   if ((b >= a+4 | b+4 <= a) &
>       (c >= a+4 | c+4 <= a))
>   {
>     m = ((unsigned long)a & 127) >> 5;
>     goto L2;
>
> L1:
>     for (; i < n; i+=4)
>     {
>       vecb = _mm_loadu_ps(b+i);
>       vecc = _mm_loadu_ps(c+i);
>       veca = _mm_mul_ps(vecb, vecc);
>       _mm_store_ps(a+i, veca);
>     }
>     m = (i == n) ? n : n+4;
>   }
>
> L2:
>   for (; i < m; i++)
>     a[i] = b[i] * c[i];
>   if (i < n)
>     goto L1;
> }
>
>
>
> thanks,
>
> Cong
>
>
> On Wed, Aug 21, 2013 at 11:50 PM, Xinliang David Li <davi...@google.com>
> wrote:
>>
>> > The effect on runtime is not correlated to
>> > either (which means the vectorizer cost model is rather bad), but
>> > integer
>> > code usually does not benefit at all.
>>
>> The cost model does need some tuning. For instance, GCC vectorizer
>> does peeling aggressively, but  peeling in many cases can be avoided
>> while still gaining good performance -- even when target does not have
>> efficient unaligned load/store to implement unaligned access. GCC
>> reports too high cost for unaligned access while too low for peeling
>> overhead.
>>
>> Example:
>>
>> ifndef TYPE
>> #define TYPE float
>> #endif
>> #include <stdlib.h>
>>
>> __attribute__((noinline)) void
>> foo (TYPE *a, TYPE* b, TYPE *c, int n)
>> {
>>    int i;
>>    for ( i = 0; i < n; i++)
>>      a[i] = b[i] * c[i];
>> }
>>
>> int g;
>> int
>> main()
>> {
>>    int i;
>>    float *a = (float*) malloc (100000*4);
>>    float *b = (float*) malloc (100000*4);
>>    float *c = (float*) malloc (100000*4);
>>
>>    for (i = 0; i < 100000; i++)
>>       foo(a, b, c, 100000);
>>
>>
>>    g = a[10];
>>
>> }
>>
>>
>> 1) by default, GCC's vectorizer will peel the loop in foo, so that
>> access to 'a' is aligned and using movaps instruction. The other
>> accesses are using movups when -march=corei7 is used
>> 2) Same as above, but -march=x86_64. Access to b is split into 'movlps
>> and movhps', same for 'c'
>>
>> 3) Disabling peeling (via a hack) with -march=corei7 --- all three
>> accesses are using movups
>> 4) Disabling peeling, with -march=x86-64 -- all three accesses are
>> using movlps/movhps
>>
>> Performance:
>>
>> 1) and 3) -- both 1.58s, but 3) is much smaller than 1). 3)'s text is
>> 1462 bytes, and 1) is 1622 bytes
>> 3) and 4) and no vectorize -- all very slow -- 4.8s
>>
>> Observations:
>> a)  if properly tuned, for corei7, 3) should be picked by GCC instead
>> of 1) -- this is not possible today
>> b) with march=x86_64, GCC should figure out the benefit of vectorizing
>> the loop is small and bail out
>>
>> >> On the other hand, 10% compile time increase due to one pass sounds
>> >> excessive -- there might be some low hanging fruit to reduce the
>> >> compile time increase.
>> >
>> > I have already spent two man-month speeding up the vectorizer itself,
>> > I don't think there is any low-hanging fruit left there.  But see above
>> > - most
>> > of the compile-time is due to the cost of processing the extra loop
>> > copies.
>> >
>>
>> Ok.
>>
>> I did not notice your patch (in May this year) until recently. Do you
>> plan to check it in (other than the part to turn in at O2). The cost
>> model part of the changes are largely independent. If it is in, it
>> will serve as a good basis for further tuning.
>>
>>
>> >>  at full feature set vectorization regresses runtime of quite a number
>> >> of benchmarks significantly. At reduced feature set - basically trying
>> >> to vectorize only obvious profitable cases - these regressions can be
>> >> avoided but progressions only remain on two spec fp cases. As most
>> >> user applications fall into the spec int category a 10% compile-time
>> >> and 15% code-size regression for no gain is no good.
>> >>>
>> >>
>> >> Cong's data (especially corei7 and corei7avx) shows more significant
>> >> performance improvement.   If 10% compile time increase is across the
>> >> board and happens on benchmarks with no performance improvement, it is
>> >> certainly bad - but I am not sure if that is the case.
>> >
>> > Note that we are talking about -O2 - people that enable -march=corei7
>> > usually
>> > know to use -O3 or FDO anyway.
>>
>> Many people uses FDO, but not all -- there are still some barriers for
>> adoption. There are reasons people may not want to use O3:
>> 1) people feel most comfortable to use O2 because it is considered the
>> most thoroughly tested compiler optimization level;  Going with the
>> default is the natural choice. FDO is a different beast as the
>> performance benefit can be too high to resist;
>> 2) In a distributed build environment with object file
>> caching/sharing, building with O3 (different from the default) leads
>> to longer build time;
>> 3) The size/compile time cost can be too high with O3. On the other
>> hand, the benefit of vectorizer can be very high for many types of
>> applications such as image processing, stitching, image detection,
>> dsp, encoder/decoder -- other than numerical fortran programs.
>>
>>
>> > That said, I expect 99% of used software
>> > (probably rather 99,99999%) is not compiled on the system it runs on but
>> > compiled to run on generic hardware and thus restricts itself to bare
>> > x86_64
>> > SSE2 features.  So what matters for enabling the vectorizer at -O2 is
>> > the
>> > default architecture features of the given architecture(!) - remember
>> > to not only
>> > consider x86 here!
>> >
>> >> A couple of points I'd like to make:
>> >>
>> >> 1) loop vectorizer passes the quality threshold to be turned on by
>> >> default at O2 in 4.9; It is already turned on for FDO at O2.
>> >
>> > With FDO we have a _much_ better way of reasoning on which loops
>> > we spend the compile-time and code-size!  Exactly the problem that
>> > exists without FDO at -O2 (and also at -O3, but -O3 is not said to
>> > be well-balanced with regard to compile-time and code-size)
>> >
>> >> 2) there are still lots of room for improvement for loop vectorizer --
>> >> there is no doubt about it, and we will need to continue improving it;
>> >
>> > I believe we have to first do that.  See the patches regarding to the
>> > cost model reorg I posted with the proposal for enabling vectorization
>> > at -O2.
>> > One large source of collateral damage of vectorization is if-conversion
>> > which
>> > aggressively if-converts loops regardless of us later vectorizing the
>> > result.
>> > The if-conversion pass needs to be integrated with vectorization.
>>
>> We notice some small performance problems with tree-if conversion that
>> is turned on with FDO -- because that pass does not have  cost model
>> (by looking at branch probability as rtl level if-cvt). What other
>> problems do you see? is it just compile time concern?
>>
>> >
>> >> 3) the only fast way to improve a feature is to get it used widely so
>> >> that people can file bugs and report problems -- it is hard for
>> >> developers to find and collect all cases where GCC is weak without GCC
>> >> community's help; There might be a temporary regression for some
>> >> users, but it is worth the pain
>> >
>> > Well, introducing known regressions at -O2 is not how this works.
>> > Vectorization is already widely tested and you can look at a plethora of
>> > bugreports about missed features and vectorizer wrong-doings to improve
>> > it.
>> >
>> >> 4) Not the most important one, but a practical concern:  without
>> >> turning it on, GCC will be greatly disadvantaged when people start
>> >> doing benchmarking latest GCC against other compilers ..
>> >
>> > The same argument was done on the fact that GCC does not optimize by
>> > default
>> > but uses -O0.  It's a straw-mans argument.  All "benchmarking" I see
>> > uses
>> > -O3 or -Ofast already.
>>
>> People can just do -O2 performance comparison.
>>
>> thanks,
>>
>> David
>>
>> > To make vectorization have a bigger impact on day-to-day software GCC
>> > would need
>> > to start versioning for the target sub-architecture - which of course
>> > increases the
>> > issue with code-size and compile-time.
>> >
>> > Richard.
>> >
>> >> thanks,
>> >>
>> >> David
>> >>
>> >>
>> >>
>> >>> Richard.
>> >>>
>> >>>>thanks,
>> >>>>
>> >>>>David
>> >>>>
>> >>>>
>> >>>>>
>> >>>>> Richard.
>> >>>>>
>> >>>>>>>
>> >>>>>>> Vectorization has great performance potential -- the more people
>> >>>>use
>> >>>>>>> it, the likely it will be further improved -- turning it on at O2
>> >>>>is
>> >>>>>>> the way to go ...
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Thank you!
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Cong Hou
>> >>>>>
>> >>>>>
>> >>>
>> >>>
>
>

Re: Propose moving vectorization from -O3 to -O2.

Reply via email to