Re: [PATCH] Fix PR83008

Christophe Lyon Mon, 05 Feb 2018 06:20:47 -0800

On 31 January 2018 at 16:01, Richard Biener <rguent...@suse.de> wrote:


> On Wed, 31 Jan 2018, Christophe Lyon wrote:
>
> > On 30 January 2018 at 11:47, Jakub Jelinek <ja...@redhat.com> wrote:
> > > On Tue, Jan 30, 2018 at 11:07:50AM +0100, Richard Biener wrote:
> > >>
> > >> I have been asked to push this change, fixing (somewhat) the
> impreciseness
> > >> of costing constant/invariant vector uses in SLP stmts.  The previous
> > >> code always just considered a single constant to be generated in the
> > >> prologue irrespective of how many we'd need.  With this patch we
> > >> properly handle this count and optimize for the case when we can use
> > >> a vector splat.  It doesn't yet handle CSE (or CSE among stmts) which
> > >> means it could in theory regress cases it overall costed correctly
> > >> before "optimistically" (aka by accident).  But at least the costing
> > >> now matches code generation.
> > >>
> > >> Bootstrapped and tested on x86_64-unknown-linux-gnu.  On x86_64
> > >> Haswell with AVX2 SPEC 2k6 shows no off-noise changes.
> > >>
> > >> The patch is said to help the case in the PR when additional backend
> > >> costing changes are done (for AVX512).
> > >>
> > >> Ok for trunk at this stage?
> > >
> > > LGTM.
> > >
> > >> 2018-01-30  Richard Biener  <rguent...@suse.de>
> > >>
> > >>       PR tree-optimization/83008
> > >>       * tree-vect-slp.c (vect_analyze_slp_cost_1): Properly cost
> > >>       invariant and constant vector uses in stmts when they need
> > >>       more than one stmt.
> > >
> > >         Jakub
> >
> > Hi Richard,
> >
> > This patch caused a regression on aarch64*:
> > FAIL: gcc.dg/cse_recip.c scan-tree-dump-times optimized "rdiv_expr" 1
> > (found 2 times)
> > we used to have:
> > PASS: gcc.dg/cse_recip.c scan-tree-dump-times optimized "rdiv_expr" 1
>
> We now vectorize this on aarch64 - looks like there's a V2SFmode
> available.  This means we get 1/x computed and divide by {x, x}.
> The former is non-optimal because we leave dead code around after
> SLP vectorization which the multi-use check of the recip pass
> trips on to make this transform profitable.
>
> That's worth a bugreport I think.
>
OK, I filed
 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84214


> For the testcase I'd simply adjust it to pass -fno-slp-vectorize
>
I'll do that.

Thanks,

Christophe


> -- or make sure to run the recip pass before vectorization.  Not
> sure why it runs before loop optimizations?
>
> Index: gcc/passes.def
> ===================================================================
> --- gcc/passes.def      (revision 257233)
> +++ gcc/passes.def      (working copy)
> @@ -263,6 +263,7 @@ along with GCC; see the file COPYING3.
>        NEXT_PASS (pass_asan);
>        NEXT_PASS (pass_tsan);
>        NEXT_PASS (pass_dce);
> +      NEXT_PASS (pass_cse_reciprocals);
>        /* Pass group that runs when 1) enabled, 2) there are loops
>          in the function.  Make sure to run pass_fix_loops before
>          to discover/remove loops before running the gate function
> @@ -317,7 +318,6 @@ along with GCC; see the file COPYING3.
>        POP_INSERT_PASSES ()
>        NEXT_PASS (pass_simduid_cleanup);
>        NEXT_PASS (pass_lower_vector_ssa);
> -      NEXT_PASS (pass_cse_reciprocals);
>        NEXT_PASS (pass_sprintf_length, true);
>        NEXT_PASS (pass_reassoc, false /* insert_powi_p */);
>        NEXT_PASS (pass_strength_reduction);
>
> puts it right before loop opts and after a DCE pass.  This results
> in us no longer vectorizing the code:
>
>   Vector inside of basic block cost: 4
>   Vector prologue cost: 4
>   Vector epilogue cost: 0
>   Scalar cost of basic block: 6
> /space/rguenther/src/svn/early-lto-debug/gcc/testsuite/
> gcc.dg/cse_recip.c:10:1:
> note: not vectorized: vectorization is not profitable.
>
> Not sure if we want to shuffle passes at this stage though.
>
> Richard.
>

Re: [PATCH] Fix PR83008

Reply via email to