On Mon, 21 Feb 2022, Hongtao Liu wrote: > On Fri, Feb 18, 2022 at 10:01 PM Richard Biener via Gcc-patches > <gcc-patches@gcc.gnu.org> wrote: > > > > This uses the now passed SLP node to the vectorizer costing hook > > to adjust vector construction costs for the cost of moving an > > integer component from a GPR to a vector register when that's > > required for building a vector from components. A cruical difference > > here is whether the component is loaded from memory or extracted > > from a vector register as in those cases no intermediate GPR is involved. > > > > The pr99881.c testcase can be Un-XFAILed with this patch, the > > pr91446.c testcase now produces scalar code which looks superior > > to me so I've adjusted it as well. > > > > I'm currently re-bootstrapping and testing on x86_64-unknown-linux-gnu > > after adding the BIT_FIELD_REF vector extracting special casing. > Does the patch handle PR101929?
The patch will regress the testcase posted in PR101929 again: _255 1 times scalar_store costs 12 in body _261 1 times scalar_store costs 12 in body _258 1 times scalar_store costs 12 in body _264 1 times scalar_store costs 12 in body t0_247 + t2_251 1 times scalar_stmt costs 4 in body t1_472 + t3_444 1 times scalar_stmt costs 4 in body t0_406 - t2_451 1 times scalar_stmt costs 4 in body t1_472 - t3_444 1 times scalar_stmt costs 4 in body -node 0x4182f48 1 times vec_construct costs 16 in prologue -node 0x41882b0 1 times vec_construct costs 16 in prologue +node 0x4182f48 1 times vec_construct costs 28 in prologue +node 0x41882b0 1 times vec_construct costs 28 in prologue t0_406 + t2_451 1 times vector_stmt costs 4 in body t1_472 - t3_444 1 times vector_stmt costs 4 in body node 0x41829f8 1 times vec_perm costs 4 in body _436 1 times vector_store costs 16 in body t.c:37:9: note: Cost model analysis for part in loop 0: - Vector cost: 60 + Vector cost: 84 Scalar cost: 64 +t.c:37:9: missed: not vectorized: vectorization is not profitable. We're constructing V4SI from patterns like { _407, _480, _407, _480 } where the components are results of integer adds (so the result is definitely in a GPR). We are costing the construction as 4 * sse_op + 2 * sse_to_integer which with skylake cost is 4 * COSTS_N_INSNS (1) + 2 * 6. Whether the vectorization itself is profitable is likely questionable but then it's true that the construction of V4SI is more costly in terms of uops than a construction of V4SF. Now, we can - for the first time - now see the actual construction pattern and ideal construction might be two GPR->xmm moves + two splats + one unpack or maybe two GPR->xmm moves + one unpack + splat of DI (or other means of duplicating the lowpart). It still wouldn't help here of course, and we would not have RTL expansion adhere to this scheme. Trying to capture secondary effects (the FRE opportunity unleashed) in costing of this particular SLP subgraph is difficult and probably undesirable though. Adjusting any of the target specific costs is likely not going to work because of the original narrow profitability gap (60 vs. 64). For the particular kernel the overall vectorization strathegy needs to improve (PR98138 is about this I think). I know we've reverted the previous change that attempted to cost for GPR -> XMM. It did case vec_construct: { /* N element inserts into SSE vectors. */ + int cost + = TYPE_VECTOR_SUBPARTS (vectype) * (fp ? + ix86_cost->sse_op + : ix86_cost->integer_to_sse); + - int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; thus treated integer_to_sse as GPR insert into vector cost for integer components in general while the now proposed change _adds_ integer to XMM move costs (but only once for duplicate elements and not for the cases we know are from memory / vector regs). With the proposed patch we can probably be more "correct" above and only cost TYPE_VECTOR_SUBPARTS (vectype) - 1 inserts given for FP the first one is free and for int we're costing it separately. Again it won't help here. Thanks, Richard. > > > > I suppose we can let autotesters look for SPEC performance fallout. > > > > OK if testing succeeds? > > > > Thanks, > > Richard. > > > > 2022-02-18 Richard Biener <rguent...@suse.de> > > > > PR tree-optimization/104582 > > PR target/99881 > > * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): > > Cost GPR to vector register moves for integer vector construction. > > > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: New. > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c: Likewise. > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Likewise. > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Likewise. > > * gcc.target/i386/pr99881.c: Un-XFAIL. > > * gcc.target/i386/pr91446.c: Adjust to not expect vectorization. > > --- > > gcc/config/i386/i386.cc | 45 ++++++++++++++++++- > > .../costmodel/x86_64/costmodel-pr104582-1.c | 15 +++++++ > > .../costmodel/x86_64/costmodel-pr104582-2.c | 13 ++++++ > > .../costmodel/x86_64/costmodel-pr104582-3.c | 13 ++++++ > > .../costmodel/x86_64/costmodel-pr104582-4.c | 15 +++++++ > > gcc/testsuite/gcc.target/i386/pr91446.c | 2 +- > > gcc/testsuite/gcc.target/i386/pr99881.c | 2 +- > > 7 files changed, 102 insertions(+), 3 deletions(-) > > create mode 100644 > > gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > > create mode 100644 > > gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c > > create mode 100644 > > gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > > create mode 100644 > > gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > index 0830dbd7dca..b2bf90576d5 100644 > > --- a/gcc/config/i386/i386.cc > > +++ b/gcc/config/i386/i386.cc > > @@ -22997,7 +22997,7 @@ ix86_vectorize_create_costs (vec_info *vinfo, bool > > costing_for_scalar) > > > > unsigned > > ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > > - stmt_vec_info stmt_info, slp_tree, > > + stmt_vec_info stmt_info, slp_tree node, > > tree vectype, int misalign, > > vect_cost_model_location where) > > { > > @@ -23160,6 +23160,49 @@ ix86_vector_costs::add_stmt_cost (int count, > > vect_cost_for_stmt kind, > > stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, > > misalign); > > stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1); > > } > > + else if (kind == vec_construct > > + && node > > + && SLP_TREE_DEF_TYPE (node) == vect_external_def > > + && INTEGRAL_TYPE_P (TREE_TYPE (vectype))) > > + { > > + stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, > > misalign); > > + unsigned i; > > + tree op; > > + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op) > > + if (TREE_CODE (op) == SSA_NAME) > > + TREE_VISITED (op) = 0; > > + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op) > > + { > > + if (TREE_CODE (op) != SSA_NAME > > + || TREE_VISITED (op)) > > + continue; > > + TREE_VISITED (op) = 1; > > + gimple *def = SSA_NAME_DEF_STMT (op); > > + tree tem; > > + if (is_gimple_assign (def) > > + && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (def)) > > + && ((tem = gimple_assign_rhs1 (def)), true) > > + && TREE_CODE (tem) == SSA_NAME > > + /* A sign-change expands to nothing. */ > > + && tree_nop_conversion_p (TREE_TYPE (gimple_assign_lhs (def)), > > + TREE_TYPE (tem))) > > + def = SSA_NAME_DEF_STMT (tem); > > + /* When the component is loaded from memory we can directly > > + move it to a vector register, otherwise we have to go > > + via a GPR or via vpinsr which involves similar cost. > > + Likewise with a BIT_FIELD_REF extracting from a vector > > + register we can hope to avoid using a GPR. */ > > + if (!is_gimple_assign (def) > > + || (!gimple_assign_load_p (def) > > + && (gimple_assign_rhs_code (def) != BIT_FIELD_REF > > + || !VECTOR_TYPE_P (TREE_TYPE > > + (TREE_OPERAND (gimple_assign_rhs1 (def), > > 0)))))) > > + stmt_cost += ix86_cost->sse_to_integer; > > + } > > + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op) > > + if (TREE_CODE (op) == SSA_NAME) > > + TREE_VISITED (op) = 0; > > + } > > if (stmt_cost == -1) > > stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > > > > diff --git > > a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > > b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > > new file mode 100644 > > index 00000000000..992a845ad7a > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > > @@ -0,0 +1,15 @@ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > > + > > +struct S { unsigned long a, b; } s; > > + > > +void > > +foo (unsigned long *a, unsigned long *b) > > +{ > > + unsigned long a_ = *a; > > + unsigned long b_ = *b; > > + s.a = a_; > > + s.b = b_; > > +} > > + > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > > diff --git > > a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c > > b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c > > new file mode 100644 > > index 00000000000..7637cdb4a97 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c > > @@ -0,0 +1,13 @@ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > > + > > +struct S { unsigned long a, b; } s; > > + > > +void > > +foo (unsigned long a, unsigned long b) > > +{ > > + s.a = a; > > + s.b = b; > > +} > > + > > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } > > } */ > > diff --git > > a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > > b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > > new file mode 100644 > > index 00000000000..999c4905708 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > > @@ -0,0 +1,13 @@ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > > + > > +struct S { double a, b; } s; > > + > > +void > > +foo (double a, double b) > > +{ > > + s.a = a; > > + s.b = b; > > +} > > + > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > > diff --git > > a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > > b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > > new file mode 100644 > > index 00000000000..cc471e1ed73 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > > @@ -0,0 +1,15 @@ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > > + > > +struct S { unsigned long a, b; } s; > > + > > +void > > +foo (signed long *a, unsigned long *b) > > +{ > > + unsigned long a_ = *a; > > + unsigned long b_ = *b; > > + s.a = a_; > > + s.b = b_; > > +} > > + > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > > diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c > > b/gcc/testsuite/gcc.target/i386/pr91446.c > > index 0243ca3ea68..067bf43f698 100644 > > --- a/gcc/testsuite/gcc.target/i386/pr91446.c > > +++ b/gcc/testsuite/gcc.target/i386/pr91446.c > > @@ -21,4 +21,4 @@ foo (unsigned long long width, unsigned long long height, > > bar (&t); > > } > > > > -/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */ > > +/* { dg-final { scan-assembler-times "xmm\[0-9\]" 0 } } */ > > diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c > > b/gcc/testsuite/gcc.target/i386/pr99881.c > > index 3e087eb2ed7..a1ec1d1ba8a 100644 > > --- a/gcc/testsuite/gcc.target/i386/pr99881.c > > +++ b/gcc/testsuite/gcc.target/i386/pr99881.c > > @@ -1,7 +1,7 @@ > > /* PR target/99881. */ > > /* { dg-do compile { target { ! ia32 } } } */ > > /* { dg-options "-Ofast -march=skylake" } */ > > -/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */ > > +/* { dg-final { scan-assembler-not "xmm\[0-9\]" } } */ > > > > void > > foo (int* __restrict a, int n, int c) > > -- > > 2.34.1 > > > > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)