https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008
--- Comment #22 from rguenther at suse dot de <rguenther at suse dot de> --- On Wed, 10 Jan 2018, sergey.shalnov at intel dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 > > --- Comment #21 from sergey.shalnov at intel dot com --- > Thanks Richard for your comments. > Based on our discussion I've produced the patch attached and > run it on SPEC2017intrate/fprate on skylake server (with [-Ofast -flto > -march=skylake-avx512 -mfpmath=sse -funroll-loops]). > Please note, I used your last proposed patch and changed loop trip count > calculation ("ncopies_for_cost * nunits / group_size" is always 1). > > I see the following performance changes: > SPEC CPU 2017 intrate > 500.perlbench_r -0.7% > 525.x264_r +7.2% > Geomean: +0.8% > > SPEC CPU 2017 fprate > 527.cam4_r -1.1% > 538.imagick_r +4.7% > 544.nab_r +3.6% > Geomean: +0.6% > > I believe that after appropriate cost model tweaks for other targets a gain > could be observed but I haven't checked it carefully. > It provides a good performance gain for the original case and a few other > improvements. > > Can you please take a look at the patch and comment (or might propose another > method)? It mixes several things, the last one (> to >= change in cost evaluation clearly wrong). The skylake_cost changes look somewhat odd to me. I'll attach my current SLP costing adjustment patch (after the SVE changes the old didn't build anymore). > Sergey > > From 41e5094cbdce72d4cc5e04fc3d11c01c3c1adbb7 Mon Sep 17 00:00:00 2001 > From: Sergey Shalnov <sergey.shal...@intel.com> > Date: Tue, 9 Jan 2018 14:37:14 +0100 > Subject: [PATCH, SLP] SLP_common_algo_changes > > --- > gcc/config/i386/x86-tune-costs.h | 4 ++-- > gcc/tree-vect-slp.c | 41 > ++++++++++++++++++++++++++++++---------- > 2 files changed, 33 insertions(+), 12 deletions(-) > > diff --git a/gcc/config/i386/x86-tune-costs.h > b/gcc/config/i386/x86-tune-costs.h > index 312467d..3e0f904 100644 > --- a/gcc/config/i386/x86-tune-costs.h > +++ b/gcc/config/i386/x86-tune-costs.h > @@ -1555,7 +1555,7 @@ struct processor_costs skylake_cost = { > {4, 4, 4}, /* cost of loading integer registers > in QImode, HImode and SImode. > Relative to reg-reg move (2). */ > - {6, 6, 6}, /* cost of storing integer registers > */ > + {6, 6, 4}, /* cost of storing integer registers. > */ > 2, /* cost of reg,reg fld/fst */ > {6, 6, 8}, /* cost of loading fp registers > in SFmode, DFmode and XFmode */ > @@ -1570,7 +1570,7 @@ struct processor_costs skylake_cost = { > {6, 6, 6, 10, 20}, /* cost of loading SSE registers > in 32,64,128,256 and 512-bit */ > {6, 6, 6, 10, 20}, /* cost of unaligned loads. */ > - {8, 8, 8, 8, 16}, /* cost of storing SSE registers > + {8, 8, 8, 16, 32}, /* cost of storing SSE registers > in 32,64,128,256 and 512-bit */ > {8, 8, 8, 8, 16}, /* cost of unaligned stores. */ > 2, 2, /* SSE->integer and > integer->SSE moves */ > diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c > index 0ca42b4..7e63a1c 100644 > --- a/gcc/tree-vect-slp.c > +++ b/gcc/tree-vect-slp.c > @@ -1815,18 +1815,39 @@ vect_analyze_slp_cost_1 (slp_instance instance, > slp_tree node, > enum vect_def_type dt; > if (!op || op == lhs) > continue; > - if (vect_is_simple_use (op, stmt_info->vinfo, &def_stmt, &dt)) > + if (vect_is_simple_use (op, stmt_info->vinfo, &def_stmt, &dt) > + && (dt == vect_constant_def || dt == vect_external_def)) > { > /* Without looking at the actual initializer a vector of > constants can be implemented as load from the constant pool. > - ??? We need to pass down stmt_info for a vector type > - even if it points to the wrong stmt. */ > - if (dt == vect_constant_def) > - record_stmt_cost (prologue_cost_vec, 1, vector_load, > - stmt_info, 0, vect_prologue); > - else if (dt == vect_external_def) > - record_stmt_cost (prologue_cost_vec, 1, vec_construct, > - stmt_info, 0, vect_prologue); > + When all elements are the same we can use a splat. */ > + unsigned group_size = SLP_TREE_SCALAR_STMTS (node).length (); > + tree elt = NULL_TREE; > + unsigned nelt = 0; > + for (unsigned j = 0; j < ncopies_for_cost; ++j) > + for (unsigned k = 0; k < group_size; ++k) > + { > + if (nelt == 0) > + elt = gimple_op (SLP_TREE_SCALAR_STMTS (node)[nelt], i); > + /* ??? We're just tracking whether all operands of a single > + vector initializer are the same, ideally we'd check if > + we emitted the same one already. */ > + else if (elt != gimple_op (SLP_TREE_SCALAR_STMTS (node)[nelt], > + i)) > + elt = NULL_TREE; > + nelt++; > + if (nelt == group_size) > + { > + /* ??? We need to pass down stmt_info for a vector type > + even if it points to the wrong stmt. */ > + record_stmt_cost (prologue_cost_vec, 1, > + dt == vect_external_def > + ? (elt ? scalar_to_vec : vec_construct) > + : vector_load, > + stmt_info, 0, vect_prologue); > + nelt = 0; > + } > + } > } > } > > @@ -2820,7 +2841,7 @@ vect_bb_vectorization_profitable_p (bb_vec_info > bb_vinfo) > the cost estimate is otherwise quite pessimistic (constant uses are > free on the scalar side but cost a load on the vector side for > example). */ > - if (vec_outside_cost + vec_inside_cost > scalar_cost) > + if (vec_outside_cost + vec_inside_cost >= scalar_cost) > return false; > > return true; > -- > 1.8.3.1 > > --- Comment #23 from Richard Biener <rguenth at gcc dot gnu.org> --- Created attachment 43084 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43084&action=edit SLP costing for constants/externs improvement