On Thu, Aug 22, 2013 at 8:50 AM, Xinliang David Li <davi...@google.com> wrote:
>> The effect on runtime is not correlated to
>> either (which means the vectorizer cost model is rather bad), but integer
>> code usually does not benefit at all.
>
> The cost model does need some tuning. For instance, GCC vectorizer
> does peeling aggressively, but  peeling in many cases can be avoided
> while still gaining good performance -- even when target does not have
> efficient unaligned load/store to implement unaligned access. GCC
> reports too high cost for unaligned access while too low for peeling
> overhead.
>
> Example:
>
> ifndef TYPE
> #define TYPE float
> #endif
> #include <stdlib.h>
>
> __attribute__((noinline)) void
> foo (TYPE *a, TYPE* b, TYPE *c, int n)
> {
>    int i;
>    for ( i = 0; i < n; i++)
>      a[i] = b[i] * c[i];
> }
>
> int g;
> int
> main()
> {
>    int i;
>    float *a = (float*) malloc (100000*4);
>    float *b = (float*) malloc (100000*4);
>    float *c = (float*) malloc (100000*4);
>
>    for (i = 0; i < 100000; i++)
>       foo(a, b, c, 100000);
>
>
>    g = a[10];
>
> }
>
>
> 1) by default, GCC's vectorizer will peel the loop in foo, so that
> access to 'a' is aligned and using movaps instruction. The other
> accesses are using movups when -march=corei7 is used
> 2) Same as above, but -march=x86_64. Access to b is split into 'movlps
> and movhps', same for 'c'
>
> 3) Disabling peeling (via a hack) with -march=corei7 --- all three
> accesses are using movups
> 4) Disabling peeling, with -march=x86-64 -- all three accesses are
> using movlps/movhps
>
> Performance:
>
> 1) and 3) -- both 1.58s, but 3) is much smaller than 1). 3)'s text is
> 1462 bytes, and 1) is 1622 bytes
> 3) and 4) and no vectorize -- all very slow -- 4.8s
>
> Observations:
> a)  if properly tuned, for corei7, 3) should be picked by GCC instead
> of 1) -- this is not possible today
> b) with march=x86_64, GCC should figure out the benefit of vectorizing
> the loop is small and bail out
>
>>> On the other hand, 10% compile time increase due to one pass sounds
>>> excessive -- there might be some low hanging fruit to reduce the
>>> compile time increase.
>>
>> I have already spent two man-month speeding up the vectorizer itself,
>> I don't think there is any low-hanging fruit left there.  But see above - 
>> most
>> of the compile-time is due to the cost of processing the extra loop copies.
>>
>
> Ok.
>
> I did not notice your patch (in May this year) until recently. Do you
> plan to check it in (other than the part to turn in at O2). The cost
> model part of the changes are largely independent. If it is in, it
> will serve as a good basis for further tuning.
>
>
>>>  at full feature set vectorization regresses runtime of quite a number
>>> of benchmarks significantly. At reduced feature set - basically trying
>>> to vectorize only obvious profitable cases - these regressions can be
>>> avoided but progressions only remain on two spec fp cases. As most
>>> user applications fall into the spec int category a 10% compile-time
>>> and 15% code-size regression for no gain is no good.
>>>>
>>>
>>> Cong's data (especially corei7 and corei7avx) shows more significant
>>> performance improvement.   If 10% compile time increase is across the
>>> board and happens on benchmarks with no performance improvement, it is
>>> certainly bad - but I am not sure if that is the case.
>>
>> Note that we are talking about -O2 - people that enable -march=corei7 usually
>> know to use -O3 or FDO anyway.
>
> Many people uses FDO, but not all -- there are still some barriers for
> adoption. There are reasons people may not want to use O3:
> 1) people feel most comfortable to use O2 because it is considered the
> most thoroughly tested compiler optimization level;  Going with the
> default is the natural choice. FDO is a different beast as the
> performance benefit can be too high to resist;
> 2) In a distributed build environment with object file
> caching/sharing, building with O3 (different from the default) leads
> to longer build time;
> 3) The size/compile time cost can be too high with O3. On the other
> hand, the benefit of vectorizer can be very high for many types of
> applications such as image processing, stitching, image detection,
> dsp, encoder/decoder -- other than numerical fortran programs.
>
>
>> That said, I expect 99% of used software
>> (probably rather 99,99999%) is not compiled on the system it runs on but
>> compiled to run on generic hardware and thus restricts itself to bare x86_64
>> SSE2 features.  So what matters for enabling the vectorizer at -O2 is the
>> default architecture features of the given architecture(!) - remember
>> to not only
>> consider x86 here!
>>
>>> A couple of points I'd like to make:
>>>
>>> 1) loop vectorizer passes the quality threshold to be turned on by
>>> default at O2 in 4.9; It is already turned on for FDO at O2.
>>
>> With FDO we have a _much_ better way of reasoning on which loops
>> we spend the compile-time and code-size!  Exactly the problem that
>> exists without FDO at -O2 (and also at -O3, but -O3 is not said to
>> be well-balanced with regard to compile-time and code-size)
>>
>>> 2) there are still lots of room for improvement for loop vectorizer --
>>> there is no doubt about it, and we will need to continue improving it;
>>
>> I believe we have to first do that.  See the patches regarding to the
>> cost model reorg I posted with the proposal for enabling vectorization at 
>> -O2.
>> One large source of collateral damage of vectorization is if-conversion which
>> aggressively if-converts loops regardless of us later vectorizing the result.
>> The if-conversion pass needs to be integrated with vectorization.
>
> We notice some small performance problems with tree-if conversion that
> is turned on with FDO -- because that pass does not have  cost model
> (by looking at branch probability as rtl level if-cvt). What other
> problems do you see? is it just compile time concern?
>
>>
>>> 3) the only fast way to improve a feature is to get it used widely so
>>> that people can file bugs and report problems -- it is hard for
>>> developers to find and collect all cases where GCC is weak without GCC
>>> community's help; There might be a temporary regression for some
>>> users, but it is worth the pain
>>
>> Well, introducing known regressions at -O2 is not how this works.
>> Vectorization is already widely tested and you can look at a plethora of
>> bugreports about missed features and vectorizer wrong-doings to improve it.
>>
>>> 4) Not the most important one, but a practical concern:  without
>>> turning it on, GCC will be greatly disadvantaged when people start
>>> doing benchmarking latest GCC against other compilers ..
>>
>> The same argument was done on the fact that GCC does not optimize by default
>> but uses -O0.  It's a straw-mans argument.  All "benchmarking" I see uses
>> -O3 or -Ofast already.
>
> People can just do -O2 performance comparison.

They can also do -O performance comparison.

Richard.

> thanks,
>
> David
>
>> To make vectorization have a bigger impact on day-to-day software GCC would 
>> need
>> to start versioning for the target sub-architecture - which of course
>> increases the
>> issue with code-size and compile-time.
>>
>> Richard.
>>
>>> thanks,
>>>
>>> David
>>>
>>>
>>>
>>>> Richard.
>>>>
>>>>>thanks,
>>>>>
>>>>>David
>>>>>
>>>>>
>>>>>>
>>>>>> Richard.
>>>>>>
>>>>>>>>
>>>>>>>> Vectorization has great performance potential -- the more people
>>>>>use
>>>>>>>> it, the likely it will be further improved -- turning it on at O2
>>>>>is
>>>>>>>> the way to go ...
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank you!
>>>>>>>>
>>>>>>>>
>>>>>>>> Cong Hou
>>>>>>
>>>>>>
>>>>
>>>>

Reply via email to