Sorry for the long delay on getting back to this. I took a look at the suggested test cases with the cost model available, and did some SPEC testing to validate the model. I found that it is still important to model the 4xfloat case separately to account for conversion from 64-bit to 32-bit in our internal representation; correcting the calculation there actually results in no net change, but the commentary is now better. The default cost of N-1 that Richard added in his patch applies well to POWER also, except for V2DI and V2DF modes (N=2) where this tends to undercount the cost and encourage unprofitable SLP vectorization in some cases. Establishing a minimum cost of 2 avoids test suite regressions and produces acceptable SPEC results. (So rather than using N instead of N-1 as in the previous version of this patch, I'm using N-1 with a floor of 2.)
I looked through gcc.dg/vect/slp-4[35].c with the cost model enabled, and the results are sensible with these changes. SPEC results were all in the noise range. Bootstrapped and tested on powerpc64le-unknown-linux-gnu with no regressions. Ok for trunk? Thanks, Bill 2016-08-10 Bill Schmidt <wschm...@linux.vnet.ibm.com> * config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Correct costs for vec_construct. Index: gcc/config/rs6000/rs6000.c =================================================================== --- gcc/config/rs6000/rs6000.c (revision 239310) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -5266,16 +5266,20 @@ rs6000_builtin_vectorization_cost (enum vect_cost_ return 2; case vec_construct: - elements = TYPE_VECTOR_SUBPARTS (vectype); + /* This is a rough approximation assuming non-constant elements + constructed into a vector via element insertion. FIXME: + vec_construct is not granular enough for uniformly good + decisions. If the initialization is a splat, this is + cheaper than we estimate. Improve this someday. */ elem_type = TREE_TYPE (vectype); /* 32-bit vectors loaded into registers are stored as double - precision, so we need n/2 converts in addition to the usual - n/2 merges to construct a vector of short floats from them. */ + precision, so we need 2 permutes, 2 converts, and 1 merge + to construct a vector of short floats from them. */ if (SCALAR_FLOAT_TYPE_P (elem_type) && TYPE_PRECISION (elem_type) == 32) - return elements + 1; + return 5; else - return elements / 2 + 1; + return max (2, TYPE_VECTOR_SUBPARTS (vectype) - 1); default: gcc_unreachable (); On Mon, 2016-07-18 at 14:29 +0200, Richard Biener wrote: > On Mon, Jul 18, 2016 at 1:56 PM, Segher Boessenkool > <seg...@kernel.crashing.org> wrote: > > Hi Bill, > > > > On Fri, Jul 15, 2016 at 08:55:08AM -0500, Bill Schmidt wrote: > >> This patch is a follow-up to Richard's patch of > >> https://gcc.gnu.org/ml/gcc-patches/2016-06/msg00584.html. The cost of a > >> vec_construct (initialization of an N-way vector by N scalars) is too low, > >> which can cause too-aggressive vectorization in particular for N=8 or > >> higher. Richard changed the default cost to N-1, which is generally > >> sensible. For powerpc I am going with a slightly higher cost of N, which > >> will keep us from being less conservative than the previous values when > >> N=2. > > > >> In any case, the purpose of this patch is simply to avoid vectorizing > >> things we shouldn't when we've undercounted the cost of a vec_construct. > >> Bootstrapped and tested on powerpc64le-unknown-linux-gnu with no > >> regressions (hence the vectorization decisions in the test suite have > >> not changed). Is this ok for trunk? > > > > Do you also have a testcase where it does matter? It would be good to > > add that, then. Or is it fixing a regression? > > > > I know nothing about the cost model, so someone else will have to review, > > or I can just say "okay" ;-) > > You can maybe look at gcc.dg/vect/slp-4[35].c (and run it with the cost model > enabled). > > Richard. > > > > > Segher >