> > I tested this by building and running a bunch of workloads for SVE, > > with three options: > > > > (1) -O2 > > (2) -O2 -ftree-vectorize -fvect-cost-model=very-cheap > > (3) -O2 -ftree-vectorize [-fvect-cost-model=cheap] > > > > All three builds used the default -msve-vector-bits=scalable and > > ran with the minimum vector length of 128 bits, which should give > > a worst-case bound for the performance impact. > > > > The workloads included a mixture of microbenchmarks and full > > applications. Because it's quite an eclectic mix, there's not > > much point giving exact figures. The aim was more to get a general > > impression. > > > > Code size growth with (2) was much lower than with (3). Only a > > handful of tests increased by more than 5%, and all of them were > > microbenchmarks. > > > > In terms of performance, (2) was significantly faster than (1) > > on microbenchmarks (as expected) but also on some full apps. > > Again, performance only regressed on a handful of tests. > > > > As expected, the performance of (3) vs. (1) and (3) vs. (2) is more > > of a mixed bag. There are several significant improvements with (3) > > over (2), but also some (smaller) regressions. That seems to be in > > line with -O2 -ftree-vectorize being a kind of -O2.5. > > So previous attempts at enabling vectorization at -O2 also factored > in compile-time requirements. We've looked mainly at SPEC and > there even the current "cheap" model doesn't fare very well IIRC > and costs quite some compile-time and code-size. Turning down > vectorization even more will have even less impact on performance > but the compile-time cost will likely not shrink very much. > > I think we need ways to detect candidates that will end up > cheap or very cheap without actually doing all of the analysis > first. The current cheap model indeed costs quite some code size. I was playing with similar patch (mine simply changed the cheap model). Richard's patch tests as follows:
https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&all_changes=on&min_percentage_change=0.02&revisions=e4360e452b4c6cd56d4e21663703e920763413f5%2C11d39d4b278efb4a0134f495d698c2cd764c06e4%2Ca0917becd66182eee6eb6a7a150e38f0463b765d&include_user_branches=on https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_changes=on&min_percentage_change=0.02&revisions=e4360e452b4c6cd56d4e21663703e920763413f5%2C11d39d4b278efb4a0134f495d698c2cd764c06e4%2Ca0917becd66182eee6eb6a7a150e38f0463b765d&include_user_branches=on (not all of SPEC2k runs are finished at the time of writting the email) Here baseline is current trunk, first run is with vectorizer forced to very cheap for both -O2 and -O3/fast and last is with vectorizer forced to dynamic (so -O3/fast is same as baseline) 6.5% SPECint2017 improvement at -O2 is certainly very nice even if large part comes from x264 benchmark. CPU2006 is affected by regression of tonto and cactusADM where the second is known to be bit random. There are some regressions but they exists already at -O3 and I guess those are easier to track than usual -O3 vectorization failure so I will check if they are tracked by bugzilla. (for example 100% regression on cray should be very easy) libxul LTO linktime at -O2 goes up from real 7m47.358s user 76m49.109s sys 2m2.403s to real 8m12.651s user 80m0.704s sys 2m9.275s so about 4.1% of backend time. (overall firefox build time is about 45 minutes on my setup) For comparsion -O2 --disable-tree-pre gives me: real 7m36.438s user 73m20.167s sys 2m3.460s So 4.7% backend time. These values should be off-noise, I re-run the tests few times. I would say that the speedups for vectorization are justified especially when there are essentially zero code size costs. It depends where we set the bar on compile time... Honza