On Mon, Nov 16, 2020 at 10:58 AM Richard Sandiford <richard.sandif...@arm.com> wrote: > > Richard Biener <richard.guent...@gmail.com> writes: > > On Fri, Nov 13, 2020 at 7:35 PM Richard Sandiford via Gcc-patches > > <gcc-patches@gcc.gnu.org> wrote: > >> > >> Currently we have three vector cost models: cheap, dynamic and > >> unlimited. -O2 -ftree-vectorize uses “cheap” by default, but that's > >> still relatively aggressive about peeling and aliasing checks, > >> and can lead to significant code size growth. > >> > >> This patch adds an even more conservative choice, which for lack of > >> imagination I've called “very cheap”. It only allows vectorisation > >> if the vector code entirely replaces the scalar code. It also > >> requires one iteration of the vector loop to pay for itself, > >> regardless of how often the loop iterates. (If the vector loop > >> needs multiple iterations to be beneficial then things are > >> probably too close to call, and the conservative thing would > >> be to stick with the scalar code.) > >> > >> The idea is that this should be suitable for -O2, although the patch > >> doesn't change any defaults itself. > >> > >> I tested this by building and running a bunch of workloads for SVE, > >> with three options: > >> > >> (1) -O2 > >> (2) -O2 -ftree-vectorize -fvect-cost-model=very-cheap > >> (3) -O2 -ftree-vectorize [-fvect-cost-model=cheap] > >> > >> All three builds used the default -msve-vector-bits=scalable and > >> ran with the minimum vector length of 128 bits, which should give > >> a worst-case bound for the performance impact. > >> > >> The workloads included a mixture of microbenchmarks and full > >> applications. Because it's quite an eclectic mix, there's not > >> much point giving exact figures. The aim was more to get a general > >> impression. > >> > >> Code size growth with (2) was much lower than with (3). Only a > >> handful of tests increased by more than 5%, and all of them were > >> microbenchmarks. > >> > >> In terms of performance, (2) was significantly faster than (1) > >> on microbenchmarks (as expected) but also on some full apps. > >> Again, performance only regressed on a handful of tests. > >> > >> As expected, the performance of (3) vs. (1) and (3) vs. (2) is more > >> of a mixed bag. There are several significant improvements with (3) > >> over (2), but also some (smaller) regressions. That seems to be in > >> line with -O2 -ftree-vectorize being a kind of -O2.5. > > > > So previous attempts at enabling vectorization at -O2 also factored > > in compile-time requirements. We've looked mainly at SPEC and > > there even the current "cheap" model doesn't fare very well IIRC > > and costs quite some compile-time and code-size. > > Yeah, that seems to match what I was seeing with the cheap model: > the size could increase quite significantly. > > > Turning down vectorization even more will have even less impact on > > performance but the compile-time cost will likely not shrink very > > much. > > Agreed. We've already done most of the work by the time we decide not > to go ahead. > > I didn't really measure compile time TBH. This was mostly written > from an SVE point of view: when SVE is enabled, vectorisation is > important enough that it's IMO worth paying the compile-time cost. > > > I think we need ways to detect candidates that will end up > > cheap or very cheap without actually doing all of the analysis > > first. > > Yeah, that sounds good if it's doable. But with SVE, the aim > is to reduce the number of cases in which a loop would fail to > be vectorised on cost grounds. I hope we'll be able to do more > of that for GCC 12. > > E.g. one of the uses of the SVE2 WHILERW and WHILEWR instructions > is to clamp the amount of work that the vector loop does based on > runtime aliases. We don't yet use it for that (it's still on > the TODO list), but once we do, runtime aliases would often not > be a problem even for the very cheap model. And SVE already removes > two of the other main reasons for aborting early: the need to peel > for alignment and the need to peel for niters. > > There are cases like peeling for gaps that should produce scalar code > even with SVE, but they probably aren't common enough to have a > significant impact on compile time. > > So in a sense, the aim with SVE is to make that kind of early-out test > redundant as much as possible. > > >> The patch reorders vect_cost_model so that values are in order > >> of increasing aggressiveness, which makes it possible to use > >> range checks. The value 0 still represents “unlimited”, > >> so “if (flag_vect_cost_model)” is still a meaningful check. > >> > >> Tested on aarch64-linux-gnu, arm-linux-gnueabihf and > >> x86_64-linux-gnu. OK to install? > > > > Does the patch also vectorize with SVE loops that have > > unknown loop bound? The documentation isn't entirely > > conclusive there. > > Yeah, for SVE it vectorises. How about changing: > > For example, if each iteration of a vectorized loop would handle > exactly four iterations, … > > to: > > For example, if each iteration of a vectorized loop could only > handle exactly four iterations of the original scalar loop, … > > ?
Yeah, guess that's better. > > > Iff the iteration count is a multiple of two and the target can > > vectorize the loop with both VF 2 and VF 4 but VF 4 would be better if > > we'd use the 'cheap' cost model, does 'very-cheap' not vectorize the > > loop or does it choose VF 2? > > It would choose VF 2, if that's still a win over scalar code. OK, that's what I expected. The VF iteration is one source of compile-time that we might want to avoid somehow ... on x86_64 knowing the precise number of constant iterations should allow to only pick a subset of vector modes based on largest_pow2_factor or so? Or maybe just use the preferred SIMD mode for cheap/very-cheap? (maybe pass down the cost model kind to the target hook so targets can decide for themselves here) > > In itself the patch is reasonable, thus OK. > > Thanks. > > Richard