O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES

alalaw01 at gcc dot gnu.org Tue, 22 Dec 2015 03:20:08 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707


--- Comment #23 from alalaw01 at gcc dot gnu.org ---
Yes, difficult. I'm conscious that this is stage 3, and worried about adding
too much complexity, especially if we're writing code that we'd eventually drop
in favour of a more complete framework later (i.e. in gcc7).

I'm inclined against

> (I wondered
> if load-lanes would require more unrolling we should prefer SLP anyway?).

As we've seen cases where load-lanes requires more unrolling but the code is
still much better. Likewise your argument against

> to query whether _all_ loads need to be permuted with SLP
...
> thus if there is a load node which is not permuted then retain the SLP.

seems convincing. I think the heuristic in comment 16 handles permutation well
enough, and beyond that, sharing (rather than the permutation) then appears to
be the critical factor. Unfortunately as you say SLP doesn't really handle
sharing yet...so

> I fear that to get a better heuristic
> than what is proposed we need to push this for example to
> vect_make_slp_decision where all instances are built

Might be reasonable, but I fear it'd be of dubious benefit without:

> and we'd need to gather some sharing data therein.

I guess if that were a useful step towards

> But then there is only a small step to the point where we could actually
> compare SLP vs. non-SLP costs.

then there is some justification, but the former feels like too much complexity
at this stage - especially to do it well; how much do we really want to gather
data on the sharing that exists at present, rather than looking at removing
that sharing entirely? I'm thinking of e.g. SLP nodes that are performing the
same computations but with different permutations too - shouldn't we be aiming
at making permutations into first class citizens/operations, and making SLP
trees into DAGs? Longer-term goals, sure...

So my instinct is to go with the comment 16 patch, and accept that we take the
hit in that last testcase (i.e. the one with the sharing).

[Bug tree-optimization/68707] [6 Regression] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES

Reply via email to