On Mon, Nov 16, 2020 at 10:58 AM Richard Sandiford
<richard.sandif...@arm.com> wrote:
>
> Richard Biener <richard.guent...@gmail.com> writes:
> > On Fri, Nov 13, 2020 at 7:35 PM Richard Sandiford via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> >>
> >> Currently we have three vector cost models: cheap, dynamic and
> >> unlimited.  -O2 -ftree-vectorize uses “cheap” by default, but that's
> >> still relatively aggressive about peeling and aliasing checks,
> >> and can lead to significant code size growth.
> >>
> >> This patch adds an even more conservative choice, which for lack of
> >> imagination I've called “very cheap”.  It only allows vectorisation
> >> if the vector code entirely replaces the scalar code.  It also
> >> requires one iteration of the vector loop to pay for itself,
> >> regardless of how often the loop iterates.  (If the vector loop
> >> needs multiple iterations to be beneficial then things are
> >> probably too close to call, and the conservative thing would
> >> be to stick with the scalar code.)
> >>
> >> The idea is that this should be suitable for -O2, although the patch
> >> doesn't change any defaults itself.
> >>
> >> I tested this by building and running a bunch of workloads for SVE,
> >> with three options:
> >>
> >>   (1) -O2
> >>   (2) -O2 -ftree-vectorize -fvect-cost-model=very-cheap
> >>   (3) -O2 -ftree-vectorize [-fvect-cost-model=cheap]
> >>
> >> All three builds used the default -msve-vector-bits=scalable and
> >> ran with the minimum vector length of 128 bits, which should give
> >> a worst-case bound for the performance impact.
> >>
> >> The workloads included a mixture of microbenchmarks and full
> >> applications.  Because it's quite an eclectic mix, there's not
> >> much point giving exact figures.  The aim was more to get a general
> >> impression.
> >>
> >> Code size growth with (2) was much lower than with (3).  Only a
> >> handful of tests increased by more than 5%, and all of them were
> >> microbenchmarks.
> >>
> >> In terms of performance, (2) was significantly faster than (1)
> >> on microbenchmarks (as expected) but also on some full apps.
> >> Again, performance only regressed on a handful of tests.
> >>
> >> As expected, the performance of (3) vs. (1) and (3) vs. (2) is more
> >> of a mixed bag.  There are several significant improvements with (3)
> >> over (2), but also some (smaller) regressions.  That seems to be in
> >> line with -O2 -ftree-vectorize being a kind of -O2.5.
> >
> > So previous attempts at enabling vectorization at -O2 also factored
> > in compile-time requirements.  We've looked mainly at SPEC and
> > there even the current "cheap" model doesn't fare very well IIRC
> > and costs quite some compile-time and code-size.
>
> Yeah, that seems to match what I was seeing with the cheap model:
> the size could increase quite significantly.
>
> > Turning down vectorization even more will have even less impact on
> > performance but the compile-time cost will likely not shrink very
> > much.
>
> Agreed.  We've already done most of the work by the time we decide not
> to go ahead.
>
> I didn't really measure compile time TBH.  This was mostly written
> from an SVE point of view: when SVE is enabled, vectorisation is
> important enough that it's IMO worth paying the compile-time cost.
>
> > I think we need ways to detect candidates that will end up
> > cheap or very cheap without actually doing all of the analysis
> > first.
>
> Yeah, that sounds good if it's doable.  But with SVE, the aim
> is to reduce the number of cases in which a loop would fail to
> be vectorised on cost grounds.  I hope we'll be able to do more
> of that for GCC 12.
>
> E.g. one of the uses of the SVE2 WHILERW and WHILEWR instructions
> is to clamp the amount of work that the vector loop does based on
> runtime aliases.  We don't yet use it for that (it's still on
> the TODO list), but once we do, runtime aliases would often not
> be a problem even for the very cheap model.  And SVE already removes
> two of the other main reasons for aborting early: the need to peel
> for alignment and the need to peel for niters.
>
> There are cases like peeling for gaps that should produce scalar code
> even with SVE, but they probably aren't common enough to have a
> significant impact on compile time.
>
> So in a sense, the aim with SVE is to make that kind of early-out test
> redundant as much as possible.
>
> >> The patch reorders vect_cost_model so that values are in order
> >> of increasing aggressiveness, which makes it possible to use
> >> range checks.  The value 0 still represents “unlimited”,
> >> so “if (flag_vect_cost_model)” is still a meaningful check.
> >>
> >> Tested on aarch64-linux-gnu, arm-linux-gnueabihf and
> >> x86_64-linux-gnu.  OK to install?
> >
> > Does the patch also vectorize with SVE loops that have
> > unknown loop bound?  The documentation isn't entirely
> > conclusive there.
>
> Yeah, for SVE it vectorises.  How about changing:
>
>   For example, if each iteration of a vectorized loop would handle
>   exactly four iterations, …
>
> to:
>
>   For example, if each iteration of a vectorized loop could only
>   handle exactly four iterations of the original scalar loop, …
>
> ?

Yeah, guess that's better.

>
> > Iff the iteration count is a multiple of two and the target can
> > vectorize the loop with both VF 2 and VF 4 but VF 4 would be better if
> > we'd use the 'cheap' cost model, does 'very-cheap' not vectorize the
> > loop or does it choose VF 2?
>
> It would choose VF 2, if that's still a win over scalar code.

OK, that's what I expected.  The VF iteration is one source of
compile-time that we might want to avoid somehow ... on
x86_64 knowing the precise number of constant iterations
should allow to only pick a subset of vector modes based on
largest_pow2_factor or so?  Or maybe just use the preferred
SIMD mode for cheap/very-cheap?  (maybe pass down
the cost model kind to the target hook so targets can decide
for themselves here)

> > In itself the patch is reasonable, thus OK.
>
> Thanks.
>
> Richard

Reply via email to