> > I tested this by building and running a bunch of workloads for SVE,
> > with three options:
> >
> >   (1) -O2
> >   (2) -O2 -ftree-vectorize -fvect-cost-model=very-cheap
> >   (3) -O2 -ftree-vectorize [-fvect-cost-model=cheap]
> >
> > All three builds used the default -msve-vector-bits=scalable and
> > ran with the minimum vector length of 128 bits, which should give
> > a worst-case bound for the performance impact.
> >
> > The workloads included a mixture of microbenchmarks and full
> > applications.  Because it's quite an eclectic mix, there's not
> > much point giving exact figures.  The aim was more to get a general
> > impression.
> >
> > Code size growth with (2) was much lower than with (3).  Only a
> > handful of tests increased by more than 5%, and all of them were
> > microbenchmarks.
> >
> > In terms of performance, (2) was significantly faster than (1)
> > on microbenchmarks (as expected) but also on some full apps.
> > Again, performance only regressed on a handful of tests.
> >
> > As expected, the performance of (3) vs. (1) and (3) vs. (2) is more
> > of a mixed bag.  There are several significant improvements with (3)
> > over (2), but also some (smaller) regressions.  That seems to be in
> > line with -O2 -ftree-vectorize being a kind of -O2.5.
> 
> So previous attempts at enabling vectorization at -O2 also factored
> in compile-time requirements.  We've looked mainly at SPEC and
> there even the current "cheap" model doesn't fare very well IIRC
> and costs quite some compile-time and code-size.  Turning down
> vectorization even more will have even less impact on performance
> but the compile-time cost will likely not shrink very much.
> 
> I think we need ways to detect candidates that will end up
> cheap or very cheap without actually doing all of the analysis
> first.
The current cheap model indeed costs quite some code size.  I 
was playing with similar patch (mine simply changed the cheap model).
Richard's patch tests as follows:

https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&all_changes=on&min_percentage_change=0.02&revisions=e4360e452b4c6cd56d4e21663703e920763413f5%2C11d39d4b278efb4a0134f495d698c2cd764c06e4%2Ca0917becd66182eee6eb6a7a150e38f0463b765d&include_user_branches=on
https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_changes=on&min_percentage_change=0.02&revisions=e4360e452b4c6cd56d4e21663703e920763413f5%2C11d39d4b278efb4a0134f495d698c2cd764c06e4%2Ca0917becd66182eee6eb6a7a150e38f0463b765d&include_user_branches=on
(not all of SPEC2k runs are finished at the time of writting the email)

Here baseline is current trunk, first run is with vectorizer forced to
very cheap for both -O2 and -O3/fast and last is with vectorizer forced to
dynamic (so -O3/fast is same as baseline)

6.5% SPECint2017 improvement at -O2 is certainly very nice even if 
large part comes from x264 benchmark.
CPU2006 is affected by regression of tonto and cactusADM where the
second is known to be bit random.

There are some regressions but they exists already at -O3 and I guess
those are easier to track than usual -O3 vectorization failure so I will
check if they are tracked by bugzilla.
(for example 100% regression on cray should be very easy)

libxul LTO linktime at -O2 goes up from
real    7m47.358s
user    76m49.109s
sys     2m2.403s

to

real    8m12.651s
user    80m0.704s
sys     2m9.275s

so about 4.1% of backend time. (overall firefox build time is about 45
minutes on my setup)

For comparsion  -O2 --disable-tree-pre gives me:

real    7m36.438s
user    73m20.167s
sys     2m3.460s

So 4.7% backend time.
These values should be off-noise, I re-run the tests few times.

I would say that the speedups for vectorization are justified especially
when there are essentially zero code size costs.  It depends where we
set the bar on compile time...

Honza

Reply via email to