On Mon, 1 Apr 2019, Richard Sandiford wrote: > Richard Biener <rguent...@suse.de> writes: > > This is an update from last years attempt to tame down vectorization > > when it runs in to ABI inefficiencies at function return. I've > > updated the patch to look for multi-reg returns instead of > > !vector ones (due to word_mode vectorization) and handle a bit > > more returns, esp. the common pattern of > > > > ret = vector; > > D.1234 = ret; > > ret = {CLOBBER}; > > return D.1234; > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, OK? > > > > I know this isn't the ultimate solution but we keep getting > > duplicates of the PR. > > > > Richard. > > > > 2019-03-28 Richard Biener <rguent...@suse.de> > > > > PR tree-optimization/84101 > > * tree-vect-stmts.c: Include explow.h for hard_function_value, > > regs.h for hard_regno_nregs. > > (cfun_returns): New helper. > > (vect_model_store_cost): When vectorizing a store to a decl > > we return and the function ABI returns in a multi-reg location > > account for the possible spilling that will happen. > > > > * gcc.target/i386/pr84101.c: New testcase. > > > > Index: gcc/tree-vect-stmts.c > > =================================================================== > > --- gcc/tree-vect-stmts.c (revision 269987) > > +++ gcc/tree-vect-stmts.c (working copy) > > @@ -43,6 +43,7 @@ along with GCC; see the file COPYING3. > > #include "tree-cfg.h" > > #include "tree-ssa-loop-manip.h" > > #include "cfgloop.h" > > +#include "explow.h" > > #include "tree-ssa-loop.h" > > #include "tree-scalar-evolution.h" > > #include "tree-vectorizer.h" > > @@ -52,6 +53,7 @@ along with GCC; see the file COPYING3. > > #include "vec-perm-indices.h" > > #include "tree-ssa-loop-niter.h" > > #include "gimple-fold.h" > > +#include "regs.h" > > > > /* For lang_hooks.types.type_for_mode. */ > > #include "langhooks.h" > > @@ -948,6 +950,37 @@ vect_model_promotion_demotion_cost (stmt > > "prologue_cost = %d .\n", inside_cost, prologue_cost); > > } > > > > +/* Returns true if the current function returns DECL. */ > > + > > +static bool > > +cfun_returns (tree decl) > > +{ > > + edge_iterator ei; > > + edge e; > > + FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds) > > + { > > + greturn *ret = safe_dyn_cast <greturn *> (last_stmt (e->src)); > > + if (!ret) > > + continue; > > + if (gimple_return_retval (ret) == decl) > > + return true; > > + /* We often end up with an aggregate copy to the result decl, > > + handle that case as well. First skip intermediate clobbers > > + though. */ > > + gimple *def = ret; > > + do > > + { > > + def = SSA_NAME_DEF_STMT (gimple_vuse (def)); > > + } > > + while (gimple_clobber_p (def)); > > + if (is_a <gassign *> (def) > > + && gimple_assign_lhs (def) == gimple_return_retval (ret) > > + && gimple_assign_rhs1 (def) == decl) > > + return true; > > + } > > + return false; > > +} > > + > > /* Function vect_model_store_cost > > > > Models cost for stores. In the case of grouped accesses, one access > > @@ -1032,6 +1065,37 @@ vect_model_store_cost (stmt_vec_info stm > > vec_to_scalar, stmt_info, 0, vect_body); > > } > > > > + /* When vectorizing a store into the function result assign > > + a penalty if the function returns in a multi-register location. > > + In this case we assume we'll end up with having to spill the > > + vector result and do piecewise loads. */ > > + tree base = get_base_address (STMT_VINFO_DATA_REF (stmt_info)->ref); > > + if (base > > + && (TREE_CODE (base) == RESULT_DECL > > + || (DECL_P (base) && cfun_returns (base))) > > + && !aggregate_value_p (base, cfun->decl)) > > + { > > + rtx reg = hard_function_value (TREE_TYPE (base), cfun->decl, 0, 1); > > + /* ??? Handle PARALLEL in some way. */ > > + if (REG_P (reg)) > > + { > > + int nregs = hard_regno_nregs (REGNO (reg), GET_MODE (reg)); > > + /* Assume that a reg-reg move is possible and cheap, > > + do not account for vector to gp register move cost. */ > > + if (nregs > 1) > > Looks OK to me FWIW, but maybe it would be worth having a check like: > > targetm.secondary_memory_needed (TYPE_MODE (vectype), ALL_REGS, > REGNO_REG_CLASS (REGNO (reg))) > > as well as the above, so that we don't accidentally penalise values > that are returned in multiple vector registers. Looks like the i386.c > definition will return true in the cases that matter.
I wonder if what this hook returns makes sense if you feed it modes with different sizes? Say for the testcase V2DImode and the general regs class (DImode). No "direct" moves exist because the result doesn't fit anyway so I wonder about the semantic of secondary_memory_needed here. Wouldn't it be more appropriate to specify the correct register class instead of ALL_REGS here and choose GET_MODE (reg) for the mode instead? Because in the end we'll do N GET_MODE (reg) moves from the vector register class to the reg regclass? OTOH I have no idea how to get at the register class of 'vectype' ...? But yes, something like that might be an improvement. Still x86 _can_ do direct moves between xmm and general regs (some tunings prefer to go through memory) but the issue is really that the RTL expander will emit code that later the register allocator is not able to mangle into a form w/o spills. The issue there is really the upper half extracts of the vector register. So I'm not so sure that targetm.secondary_memory_needed is the correct tool to use for this heuristic - it is after all an RTL "optimization" issue we are working around. And we'd even like to penalize the non-memory case because going from pair: lea (%rdi,%rdi,1),%eax sar %edi movslq %edi,%rdi cltq mov %rax,-0x18(%rsp) movq -0x18(%rsp),%xmm0 mov %rdi,-0x18(%rsp) movhps -0x18(%rsp),%xmm0 movaps %xmm0,-0x18(%rsp) mov -0x18(%rsp),%rax mov -0x10(%rsp),%rdx retq to pair: lea (%rdi,%rdi,1),%eax sar %edi movslq %edi,%rdi cltq mov %rax,-0x18(%rsp) movq -0x18(%rsp),%xmm0 mov %rdi,-0x18(%rsp) movhps -0x18(%rsp),%xmm0 movlps %xmm0, %rdx movhps %xmm0, %rax retq is still worse than not vectorizing that copy (just in case targetm.secondary_memory_needed says false which it might eventually do because we strictly do not need secondary memory here). Quoting the x86 implementation of the hook it seems we're "saved" by the mode size check only: if (SSE_CLASS_P (class1) != SSE_CLASS_P (class2)) { /* SSE1 doesn't have any direct moves from other classes. */ if (!TARGET_SSE2) return true; /* If the target says that inter-unit moves are more expensive than moving through memory, then don't generate them. */ if ((SSE_CLASS_P (class1) && !TARGET_INTER_UNIT_MOVES_FROM_VEC) || (SSE_CLASS_P (class2) && !TARGET_INTER_UNIT_MOVES_TO_VEC)) return true; /* Between SSE and general, we have moves no larger than word size. */ if (GET_MODE_SIZE (mode) > UNITS_PER_WORD) return true; Your case where a target returns in multiple vector registers is one not well handled by the patch in general - there isn't any attempt at looking at the shape of the RESULT_DECL, we just see if a vector store goes somewhere into it. And we do this when processing each store. As said this is mostly a hack that covers the cases the issue was reported for and shouldn't penaltize most other cases. Yes, it might pessimize a target that returns large aggregates in multiple vector registers, but is there any? So, may I go with the original patch? Thanks, Richard. > Thanks, > Richard > > > + { > > + /* Spill. */ > > + prologue_cost += record_stmt_cost (cost_vec, ncopies, > > + vector_store, > > + stmt_info, 0, vect_epilogue); > > + /* Loads. */ > > + prologue_cost += record_stmt_cost (cost_vec, ncopies * nregs, > > + scalar_load, > > + stmt_info, 0, vect_epilogue); > > + } > > + } > > + } > > + > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_NOTE, vect_location, > > "vect_model_store_cost: inside_cost = %d, " > > Index: gcc/testsuite/gcc.target/i386/pr84101.c > > =================================================================== > > --- gcc/testsuite/gcc.target/i386/pr84101.c (nonexistent) > > +++ gcc/testsuite/gcc.target/i386/pr84101.c (working copy) > > @@ -0,0 +1,21 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-O3 -fdump-tree-slp2-details" } */ > > + > > +typedef struct uint64_pair uint64_pair_t ; > > +struct uint64_pair > > +{ > > + unsigned long w0 ; > > + unsigned long w1 ; > > +} ; > > + > > +uint64_pair_t pair(int num) > > +{ > > + uint64_pair_t p ; > > + > > + p.w0 = num << 1 ; > > + p.w1 = num >> 1 ; > > + > > + return p ; > > +} > > + > > +/* { dg-final { scan-tree-dump-not "basic block vectorized" "slp2" } } */ > -- Richard Biener <rguent...@suse.de> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg)