On Sat, Nov 21, 2020 at 9:30 PM Jan Hubicka <hubi...@ucw.cz> wrote: > > > > I tested this by building and running a bunch of workloads for SVE, > > > with three options: > > > > > > (1) -O2 > > > (2) -O2 -ftree-vectorize -fvect-cost-model=very-cheap > > > (3) -O2 -ftree-vectorize [-fvect-cost-model=cheap] > > > > > > All three builds used the default -msve-vector-bits=scalable and > > > ran with the minimum vector length of 128 bits, which should give > > > a worst-case bound for the performance impact. > > > > > > The workloads included a mixture of microbenchmarks and full > > > applications. Because it's quite an eclectic mix, there's not > > > much point giving exact figures. The aim was more to get a general > > > impression. > > > > > > Code size growth with (2) was much lower than with (3). Only a > > > handful of tests increased by more than 5%, and all of them were > > > microbenchmarks. > > > > > > In terms of performance, (2) was significantly faster than (1) > > > on microbenchmarks (as expected) but also on some full apps. > > > Again, performance only regressed on a handful of tests. > > > > > > As expected, the performance of (3) vs. (1) and (3) vs. (2) is more > > > of a mixed bag. There are several significant improvements with (3) > > > over (2), but also some (smaller) regressions. That seems to be in > > > line with -O2 -ftree-vectorize being a kind of -O2.5. > > > > So previous attempts at enabling vectorization at -O2 also factored > > in compile-time requirements. We've looked mainly at SPEC and > > there even the current "cheap" model doesn't fare very well IIRC > > and costs quite some compile-time and code-size. Turning down > > vectorization even more will have even less impact on performance > > but the compile-time cost will likely not shrink very much. > > > > I think we need ways to detect candidates that will end up > > cheap or very cheap without actually doing all of the analysis > > first. > The current cheap model indeed costs quite some code size. I > was playing with similar patch (mine simply changed the cheap model). > Richard's patch tests as follows: > > https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&all_changes=on&min_percentage_change=0.02&revisions=e4360e452b4c6cd56d4e21663703e920763413f5%2C11d39d4b278efb4a0134f495d698c2cd764c06e4%2Ca0917becd66182eee6eb6a7a150e38f0463b765d&include_user_branches=on > https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_changes=on&min_percentage_change=0.02&revisions=e4360e452b4c6cd56d4e21663703e920763413f5%2C11d39d4b278efb4a0134f495d698c2cd764c06e4%2Ca0917becd66182eee6eb6a7a150e38f0463b765d&include_user_branches=on > (not all of SPEC2k runs are finished at the time of writting the email) > > Here baseline is current trunk, first run is with vectorizer forced to > very cheap for both -O2 and -O3/fast and last is with vectorizer forced to > dynamic (so -O3/fast is same as baseline) > > 6.5% SPECint2017 improvement at -O2 is certainly very nice even if > large part comes from x264 benchmark. > CPU2006 is affected by regression of tonto and cactusADM where the > second is known to be bit random. > > There are some regressions but they exists already at -O3 and I guess > those are easier to track than usual -O3 vectorization failure so I will > check if they are tracked by bugzilla. > (for example 100% regression on cray should be very easy) > > libxul LTO linktime at -O2 goes up from > real 7m47.358s > user 76m49.109s > sys 2m2.403s > > to > > real 8m12.651s > user 80m0.704s > sys 2m9.275s > > so about 4.1% of backend time. (overall firefox build time is about 45 > minutes on my setup)
Hmm, that's unfortunate. With very-cheap we should avoid the known quadraticness (each vect_do_peeling call will do a whole-function SSA update). Which would leave the other (dependence calculation). Still profiling might make some sense here (IIRC SPEC wrf was one of the worst outliers with my measurements, but that was not avoiding all peelings) Richard. > > For comparsion -O2 --disable-tree-pre gives me: > > real 7m36.438s > user 73m20.167s > sys 2m3.460s > > So 4.7% backend time. > These values should be off-noise, I re-run the tests few times. > > I would say that the speedups for vectorization are justified especially > when there are essentially zero code size costs. It depends where we > set the bar on compile time... > > Honza