On Wed, 3 Apr 2019, Richard Sandiford wrote: > Richard Biener <rguent...@suse.de> writes: > > On Mon, 1 Apr 2019, Richard Sandiford wrote: > > > >> Richard Biener <rguent...@suse.de> writes: > >> > This is an update from last years attempt to tame down vectorization > >> > when it runs in to ABI inefficiencies at function return. I've > >> > updated the patch to look for multi-reg returns instead of > >> > !vector ones (due to word_mode vectorization) and handle a bit > >> > more returns, esp. the common pattern of > >> > > >> > ret = vector; > >> > D.1234 = ret; > >> > ret = {CLOBBER}; > >> > return D.1234; > >> > > >> > Bootstrapped and tested on x86_64-unknown-linux-gnu, OK? > >> > > >> > I know this isn't the ultimate solution but we keep getting > >> > duplicates of the PR. > >> > > >> > Richard. > >> > > >> > 2019-03-28 Richard Biener <rguent...@suse.de> > >> > > >> > PR tree-optimization/84101 > >> > * tree-vect-stmts.c: Include explow.h for hard_function_value, > >> > regs.h for hard_regno_nregs. > >> > (cfun_returns): New helper. > >> > (vect_model_store_cost): When vectorizing a store to a decl > >> > we return and the function ABI returns in a multi-reg location > >> > account for the possible spilling that will happen. > >> > > >> > * gcc.target/i386/pr84101.c: New testcase. > >> > > >> > Index: gcc/tree-vect-stmts.c > >> > =================================================================== > >> > --- gcc/tree-vect-stmts.c (revision 269987) > >> > +++ gcc/tree-vect-stmts.c (working copy) > >> > @@ -43,6 +43,7 @@ along with GCC; see the file COPYING3. > >> > #include "tree-cfg.h" > >> > #include "tree-ssa-loop-manip.h" > >> > #include "cfgloop.h" > >> > +#include "explow.h" > >> > #include "tree-ssa-loop.h" > >> > #include "tree-scalar-evolution.h" > >> > #include "tree-vectorizer.h" > >> > @@ -52,6 +53,7 @@ along with GCC; see the file COPYING3. > >> > #include "vec-perm-indices.h" > >> > #include "tree-ssa-loop-niter.h" > >> > #include "gimple-fold.h" > >> > +#include "regs.h" > >> > > >> > /* For lang_hooks.types.type_for_mode. */ > >> > #include "langhooks.h" > >> > @@ -948,6 +950,37 @@ vect_model_promotion_demotion_cost (stmt > >> > "prologue_cost = %d .\n", inside_cost, > >> > prologue_cost); > >> > } > >> > > >> > +/* Returns true if the current function returns DECL. */ > >> > + > >> > +static bool > >> > +cfun_returns (tree decl) > >> > +{ > >> > + edge_iterator ei; > >> > + edge e; > >> > + FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds) > >> > + { > >> > + greturn *ret = safe_dyn_cast <greturn *> (last_stmt (e->src)); > >> > + if (!ret) > >> > + continue; > >> > + if (gimple_return_retval (ret) == decl) > >> > + return true; > >> > + /* We often end up with an aggregate copy to the result decl, > >> > + handle that case as well. First skip intermediate clobbers > >> > + though. */ > >> > + gimple *def = ret; > >> > + do > >> > + { > >> > + def = SSA_NAME_DEF_STMT (gimple_vuse (def)); > >> > + } > >> > + while (gimple_clobber_p (def)); > >> > + if (is_a <gassign *> (def) > >> > + && gimple_assign_lhs (def) == gimple_return_retval (ret) > >> > + && gimple_assign_rhs1 (def) == decl) > >> > + return true; > >> > + } > >> > + return false; > >> > +} > >> > + > >> > /* Function vect_model_store_cost > >> > > >> > Models cost for stores. In the case of grouped accesses, one access > >> > @@ -1032,6 +1065,37 @@ vect_model_store_cost (stmt_vec_info stm > >> > vec_to_scalar, stmt_info, 0, > >> > vect_body); > >> > } > >> > > >> > + /* When vectorizing a store into the function result assign > >> > + a penalty if the function returns in a multi-register location. > >> > + In this case we assume we'll end up with having to spill the > >> > + vector result and do piecewise loads. */ > >> > + tree base = get_base_address (STMT_VINFO_DATA_REF (stmt_info)->ref); > >> > + if (base > >> > + && (TREE_CODE (base) == RESULT_DECL > >> > + || (DECL_P (base) && cfun_returns (base))) > >> > + && !aggregate_value_p (base, cfun->decl)) > >> > + { > >> > + rtx reg = hard_function_value (TREE_TYPE (base), cfun->decl, 0, > >> > 1); > >> > + /* ??? Handle PARALLEL in some way. */ > >> > + if (REG_P (reg)) > >> > + { > >> > + int nregs = hard_regno_nregs (REGNO (reg), GET_MODE (reg)); > >> > + /* Assume that a reg-reg move is possible and cheap, > >> > + do not account for vector to gp register move cost. */ > >> > + if (nregs > 1) > >> > >> Looks OK to me FWIW, but maybe it would be worth having a check like: > >> > >> targetm.secondary_memory_needed (TYPE_MODE (vectype), ALL_REGS, > >> REGNO_REG_CLASS (REGNO (reg))) > >> > >> as well as the above, so that we don't accidentally penalise values > >> that are returned in multiple vector registers. Looks like the i386.c > >> definition will return true in the cases that matter. > > > > I wonder if what this hook returns makes sense if you feed it modes > > with different sizes? Say for the testcase V2DImode and > > the general regs class (DImode). > > In this case I think it's between V2DImode and TImode. I went with > the vector mode because ABIs are the one place where BLKmode registers > are a thing. Guess that's not a problem here though, because nregs > 1 > would be false for BLKmode. So yeah, GET_MODE (reg) should be fine too.
Ah, OK. Though then the target would see TImode, ALL_REGS, general-regs which isn't what we want to ask... Leaves us wit V2DImode, ALL_REGS, general-regs then. > > No "direct" moves exist because the > > result doesn't fit anyway so I wonder about the semantic of > > secondary_memory_needed here. Wouldn't it be more appropriate to > > specify the correct register class instead of ALL_REGS here and > > choose GET_MODE (reg) for the mode instead? Because in the end > > we'll do N GET_MODE (reg) moves from the vector register class to > > the reg regclass? OTOH I have no idea how to get at the register > > class of 'vectype' ...? > > I think we only end up asking this question if the target lets both > register classes hold as many bytes as are in the mode. It might need > multiple registers, but that's OK. > > And yeah, it would be nice to be more precise than ALL_REGS, but I don't > think there's a reliable way of getting the "preferred" register class > for a mode in this context. > > The hook is supposed to be conservatively correct for ALL_REGS, > so the above should err on the side of including the cost. OK, I see. > > But yes, something like that might be an improvement. Still > > x86 _can_ do direct moves between xmm and general regs > > (some tunings prefer to go through memory) but the issue is > > really that the RTL expander will emit code that later the > > register allocator is not able to mangle into a form w/o > > spills. The issue there is really the upper half extracts > > of the vector register. So I'm not so sure that > > targetm.secondary_memory_needed is the correct tool to use > > for this heuristic - it is after all an RTL "optimization" > > issue we are working around. > > But the moves do seem to be "proper" 128-bit moves: > > (insn 11 10 12 2 (set (reg:TI 87 [ D.1914 ]) > (subreg:TI (reg:V2DI 90) 0)) "/tmp/f2.c":18:10 -1 > (nil)) > (insn 12 11 16 2 (set (reg:TI 88 [ <retval> ]) > (reg:TI 87 [ D.1914 ])) "/tmp/f2.c":18:10 -1 > (nil)) > (insn 16 12 17 2 (set (reg/i:TI 0 ax) > (reg:TI 88 [ <retval> ])) "/tmp/f2.c":19:1 -1 > (nil)) > > which eventually becomes: > > (insn 16 10 17 2 (set (reg/i:TI 0 ax) > (subreg:TI (reg:V2DI 90) 0)) "/tmp/f2.c":19:1 65 {*movti_internal} > (expr_list:REG_DEAD (reg:V2DI 90) > (nil))) > > So I don't think it's an RTL optimisation problem as such. Some targets > (like AArch64) can do the equivalent of this without spills in either > direction. Similarly, MIPS32 can use mthc1/mtc1 and mfhc1/mfc1 for > moving 64-bit FPRs to and from 32-bit GPRs. > > So I think it really is up to the target to decide whether it can/wants > to do this without spills or not. It seems wrong to cost in a spill > that should never happen. :-) True. But as you say below it's only half of the story since the vec_concat will still happen (and that is also costly, even more so if going through memory due to STLF issues). > If there is a case that needs spills, secondary_memory_reload has to > return true for that case. So I think it's the right hook to test. > > > And we'd even like to penalize the non-memory case because going from > > > > pair: > > lea (%rdi,%rdi,1),%eax > > sar %edi > > movslq %edi,%rdi > > cltq > > mov %rax,-0x18(%rsp) > > movq -0x18(%rsp),%xmm0 > > mov %rdi,-0x18(%rsp) > > movhps -0x18(%rsp),%xmm0 > > movaps %xmm0,-0x18(%rsp) > > mov -0x18(%rsp),%rax > > mov -0x10(%rsp),%rdx > > retq > > > > to > > > > pair: > > lea (%rdi,%rdi,1),%eax > > sar %edi > > movslq %edi,%rdi > > cltq > > mov %rax,-0x18(%rsp) > > movq -0x18(%rsp),%xmm0 > > mov %rdi,-0x18(%rsp) > > movhps -0x18(%rsp),%xmm0 > > movlps %xmm0, %rdx > > movhps %xmm0, %rax > > retq > > > > is still worse than not vectorizing that copy (just in > > case targetm.secondary_memory_needed says false which > > it might eventually do because we strictly do not need > > secondary memory here). > > But isn't the vector construction the expensive part of that? > I assume we wouldn't want it even if %xmm0 was stored directly > to memory rather than returned in GPRs. > > > Quoting the x86 implementation of the hook it seems we're > > "saved" by the mode size check only: > > > > if (SSE_CLASS_P (class1) != SSE_CLASS_P (class2)) > > { > > /* SSE1 doesn't have any direct moves from other classes. */ > > if (!TARGET_SSE2) > > return true; > > > > /* If the target says that inter-unit moves are more expensive > > than moving through memory, then don't generate them. */ > > if ((SSE_CLASS_P (class1) && !TARGET_INTER_UNIT_MOVES_FROM_VEC) > > || (SSE_CLASS_P (class2) && !TARGET_INTER_UNIT_MOVES_TO_VEC)) > > return true; > > > > /* Between SSE and general, we have moves no larger than word size. > > */ > > if (GET_MODE_SIZE (mode) > UNITS_PER_WORD) > > return true; > > Yeah. > > > Your case where a target returns in multiple vector registers is one > > not well handled by the patch in general - there isn't any attempt > > at looking at the shape of the RESULT_DECL, we just see if a > > vector store goes somewhere into it. And we do this when processing > > each store. As said this is mostly a hack that covers the cases > > the issue was reported for and shouldn't penaltize most other > > cases. Yes, it might pessimize a target that returns large > > aggregates in multiple vector registers, but is there any? > > SVE does this in some cases, but I guess none of them are relevant here. > > > So, may I go with the original patch? > > Still feels like we're counting spills on targets that shouldn't need them. > But going back to: > > /* Assume that a reg-reg move is possible and cheap, > do not account for vector to gp register move cost. */ > > I guess we could gloss over the "unnecessary" spill cost by saying that, > even if the spill isn't needed, this is a conservative estimate of the > vector to gp register move cost? Yes. What I really tried to ask is - am I going to need the vectorized result piecewise in the end (but not being in control of the chopping into pieces)? I wanted to pessimize that with an estimate of the "chopping cost". I probably shouldn't have talked about spilling but that's the usual straight-forward solution of extracting sth from a larger register that works everywhere. So I guess I'll update the comment and install as-is? I still hope for a better solution either on the target or the RTL optimization side (undoing the vectorization). Richard.