On 31 January 2018 at 16:01, Richard Biener <rguent...@suse.de> wrote:
> On Wed, 31 Jan 2018, Christophe Lyon wrote: > > > On 30 January 2018 at 11:47, Jakub Jelinek <ja...@redhat.com> wrote: > > > On Tue, Jan 30, 2018 at 11:07:50AM +0100, Richard Biener wrote: > > >> > > >> I have been asked to push this change, fixing (somewhat) the > impreciseness > > >> of costing constant/invariant vector uses in SLP stmts. The previous > > >> code always just considered a single constant to be generated in the > > >> prologue irrespective of how many we'd need. With this patch we > > >> properly handle this count and optimize for the case when we can use > > >> a vector splat. It doesn't yet handle CSE (or CSE among stmts) which > > >> means it could in theory regress cases it overall costed correctly > > >> before "optimistically" (aka by accident). But at least the costing > > >> now matches code generation. > > >> > > >> Bootstrapped and tested on x86_64-unknown-linux-gnu. On x86_64 > > >> Haswell with AVX2 SPEC 2k6 shows no off-noise changes. > > >> > > >> The patch is said to help the case in the PR when additional backend > > >> costing changes are done (for AVX512). > > >> > > >> Ok for trunk at this stage? > > > > > > LGTM. > > > > > >> 2018-01-30 Richard Biener <rguent...@suse.de> > > >> > > >> PR tree-optimization/83008 > > >> * tree-vect-slp.c (vect_analyze_slp_cost_1): Properly cost > > >> invariant and constant vector uses in stmts when they need > > >> more than one stmt. > > > > > > Jakub > > > > Hi Richard, > > > > This patch caused a regression on aarch64*: > > FAIL: gcc.dg/cse_recip.c scan-tree-dump-times optimized "rdiv_expr" 1 > > (found 2 times) > > we used to have: > > PASS: gcc.dg/cse_recip.c scan-tree-dump-times optimized "rdiv_expr" 1 > > We now vectorize this on aarch64 - looks like there's a V2SFmode > available. This means we get 1/x computed and divide by {x, x}. > The former is non-optimal because we leave dead code around after > SLP vectorization which the multi-use check of the recip pass > trips on to make this transform profitable. > > That's worth a bugreport I think. > OK, I filed https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84214 > For the testcase I'd simply adjust it to pass -fno-slp-vectorize > I'll do that. Thanks, Christophe > -- or make sure to run the recip pass before vectorization. Not > sure why it runs before loop optimizations? > > Index: gcc/passes.def > =================================================================== > --- gcc/passes.def (revision 257233) > +++ gcc/passes.def (working copy) > @@ -263,6 +263,7 @@ along with GCC; see the file COPYING3. > NEXT_PASS (pass_asan); > NEXT_PASS (pass_tsan); > NEXT_PASS (pass_dce); > + NEXT_PASS (pass_cse_reciprocals); > /* Pass group that runs when 1) enabled, 2) there are loops > in the function. Make sure to run pass_fix_loops before > to discover/remove loops before running the gate function > @@ -317,7 +318,6 @@ along with GCC; see the file COPYING3. > POP_INSERT_PASSES () > NEXT_PASS (pass_simduid_cleanup); > NEXT_PASS (pass_lower_vector_ssa); > - NEXT_PASS (pass_cse_reciprocals); > NEXT_PASS (pass_sprintf_length, true); > NEXT_PASS (pass_reassoc, false /* insert_powi_p */); > NEXT_PASS (pass_strength_reduction); > > puts it right before loop opts and after a DCE pass. This results > in us no longer vectorizing the code: > > Vector inside of basic block cost: 4 > Vector prologue cost: 4 > Vector epilogue cost: 0 > Scalar cost of basic block: 6 > /space/rguenther/src/svn/early-lto-debug/gcc/testsuite/ > gcc.dg/cse_recip.c:10:1: > note: not vectorized: vectorization is not profitable. > > Not sure if we want to shuffle passes at this stage though. > > Richard. >