Re: Propose moving vectorization from -O3 to -O2.

Xinliang David Li Tue, 20 Aug 2013 09:39:47 -0700

On Tue, Aug 20, 2013 at 3:59 AM, Richard Biener
<richard.guent...@gmail.com> wrote:
> Xinliang David Li <davi...@google.com> wrote:
>>On Mon, Aug 19, 2013 at 11:53 AM, Richard Biener
>><richard.guent...@gmail.com> wrote:
>>> Xinliang David Li <davi...@google.com> wrote:
>>>>+cc auto-vectorizer maintainers.
>>>>
>>>>David
>>>>
>>>>On Mon, Aug 19, 2013 at 10:37 AM, Cong Hou <co...@google.com> wrote:
>>>>> Nowadays, SIMD instructions play more and more important roles in
>>our
>>>>> daily computations. AVX and AVX2 have extended 128-bit registers to
>>>>> 256-bit ones, and the newly announced AVX-512 further doubles the
>>>>> size. The benefit we can get from vectorization will be larger and
>>>>> larger. This is also a common practice in other compilers:
>>>>>
>>>>> 1) Intel's ICC turns on vectorizer at O2 by default and it has been
>>>>> the case for many years;
>>>>>
>>>>> 2) Most recently, LLVM turns it on for both O2 and Os.
>>>>>
>>>>>
>>>>> Here we propose moving vectorization from -O3 to -O2 in GCC. Three
>>>>> main concerns about this change are: 1. Does vectorization greatly
>>>>> increase the generated code size? 2. How much performance can be
>>>>> improved? 3. Does vectorization increase  compile time
>>significantly?
>>>>>
>>>>>
>>>>> I have fixed GCC bootstrap failure with vectorizer turned on
>>>>> (http://gcc.gnu.org/ml/gcc-patches/2013-07/msg00497.html). To
>>>>evaluate
>>>>> the size and performance impact, experiments on SPEC06 and internal
>>>>> benchmarks are done. Based on the data, I have tuned the parameters
>>>>> for vectorizer which reduces the code bloat without sacrificing the
>>>>> performance gain. There are some performance regressions in SPEC06,
>>>>> and the root cause has been analyzed and understood. I will file
>>bugs
>>>>> tracking them independently. The experiments failed on three
>>>>> benchmarks (please refer to
>>>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56993). The experiment
>>>>> result is attached here as two pdf files. Below are our summaries
>>of
>>>>> the result:
>>>>>
>>>>>
>>>>> 1) We noticed that vectorization could increase the generated code
>>>>> size, so we tried to suppress this problem by doing some tunings,
>>>>> which include setting a higher loop bound so that loops with small
>>>>> iterations won't be vectorized, and disabling loop versioning. The
>>>>> average size increase is decreased from 9.84% to 7.08% after our
>>>>> tunings (13.93% to 10.75% for Fortran benchmarks, and 3.55% to
>>1.44%
>>>>> for C/C++ benchmarks). The code size increase for Fortran
>>benchmarks
>>>>> can be significant (from 18.72% to 34.15%), but the performance
>>gain
>>>>> is also huge. Hence we think this size increase is reasonable. For
>>>>> C/C++ benchmarks, the size increase is very small (below 3% except
>>>>> 447.dealII).
>>>>>
>>>>>
>>>>> 2) Vectorization improves the performance for most benchmarks by
>>>>> around 2.5%-3% on average, and much more for Fortran benchmarks. On
>>>>> Sandybridge machines, the improvement can be more if using
>>>>> -march=corei7 (3.27% on average) and -march=corei7-avx (4.81% on
>>>>> average) (Please see the attachment for details). We also noticed
>>>>that
>>>>> some performance degrades exist, and after investigation, we found
>>>>> some are caused by the defects of GCC's vectorization (e.g. GCC's
>>SLP
>>>>> could not vectorize a group of accesses if the number of group
>>cannot
>>>>> be divided by VF http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49955,
>>>>> and any data dependence between statements can prevent
>>>>vectorization),
>>>>> which can be resolved in the future.
>>>>>
>>>>>
>>>>> 3) As last, we found that introducing vectorization almost does not
>>>>> affect the build time. GCC bootstrap time increase is negligible.
>>>>>
>>>>>
>>>>> As a reference, Richard Biener is also proposing to move
>>>>vectorization
>>>>> to O2 by improving the cost model
>>>>> (http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00904.html).
>>>
>>> And my conclusion is that we are not ready for this.  The compile
>>time cost does not outweigh the benefit.
>>
>>Can you elaborate on your reasoning ?
>
> I have done measurements with spec 2006 and selective turning on parts of the 
> vectorizer at O2.  vectorizing has both a compile-time (around 10%) and 
> code-size (up to 15%) impact.


Cong only did some compile time measurement with GCC bootstrap -- and
the impact is very small. He can confirm the compile time impact with
his tuning.

>From Cong's data, benchmarks with large size increase also comes with
huge performance improvement:

o catatusADM -- size increases 18.7%, performance improves 37.5%
o leslie3d -- size increases 34.15%, performance improves 29.4%
.. etc.

CPU savings here are much larger than the cost due to small ram
increases in text.  For applications that really care about size, Os
should be used anyway.

For the compile time increase, do you see similar pattern -- i.e.,
large compile time increase --> large performance improvement?

On the other hand, 10% compile time increase due to one pass sounds
excessive -- there might be some low hanging fruit to reduce the
compile time increase.

 at full feature set vectorization regresses runtime of quite a number
of benchmarks significantly. At reduced feature set - basically trying
to vectorize only obvious profitable cases - these regressions can be
avoided but progressions only remain on two spec fp cases. As most
user applications fall into the spec int category a 10% compile-time
and 15% code-size regression for no gain is no good.
>

Cong's data (especially corei7 and corei7avx) shows more significant
performance improvement.   If 10% compile time increase is across the
board and happens on benchmarks with no performance improvement, it is
certainly bad - but I am not sure if that is the case.

A couple of points I'd like to make:

1) loop vectorizer passes the quality threshold to be turned on by
default at O2 in 4.9; It is already turned on for FDO at O2.
2) there are still lots of room for improvement for loop vectorizer --
there is no doubt about it, and we will need to continue improving it;
3) the only fast way to improve a feature is to get it used widely so
that people can file bugs and report problems -- it is hard for
developers to find and collect all cases where GCC is weak without GCC
community's help; There might be a temporary regression for some
users, but it is worth the pain
4) Not the most important one, but a practical concern:  without
turning it on, GCC will be greatly disadvantaged when people start
doing benchmarking latest GCC against other compilers ..

thanks,

David



> Richard.
>
>>thanks,
>>
>>David
>>
>>
>>>
>>> Richard.
>>>
>>>>>
>>>>> Vectorization has great performance potential -- the more people
>>use
>>>>> it, the likely it will be further improved -- turning it on at O2
>>is
>>>>> the way to go ...
>>>>>
>>>>>
>>>>> Thank you!
>>>>>
>>>>>
>>>>> Cong Hou
>>>
>>>
>
>

Re: Propose moving vectorization from -O3 to -O2.

Reply via email to