On Wed, 3 Apr 2019, Richard Sandiford wrote:

> Richard Biener <rguent...@suse.de> writes:
> > On Mon, 1 Apr 2019, Richard Sandiford wrote:
> >
> >> Richard Biener <rguent...@suse.de> writes:
> >> > This is an update from last years attempt to tame down vectorization
> >> > when it runs in to ABI inefficiencies at function return.  I've
> >> > updated the patch to look for multi-reg returns instead of
> >> > !vector ones (due to word_mode vectorization) and handle a bit
> >> > more returns, esp. the common pattern of
> >> >
> >> >   ret = vector;
> >> >   D.1234 = ret;
> >> >   ret = {CLOBBER};
> >> >   return D.1234;
> >> >
> >> > Bootstrapped and tested on x86_64-unknown-linux-gnu, OK?
> >> >
> >> > I know this isn't the ultimate solution but we keep getting
> >> > duplicates of the PR.
> >> >
> >> > Richard.
> >> >
> >> > 2019-03-28  Richard Biener  <rguent...@suse.de>
> >> >
> >> >  PR tree-optimization/84101
> >> >  * tree-vect-stmts.c: Include explow.h for hard_function_value,
> >> >  regs.h for hard_regno_nregs.
> >> >  (cfun_returns): New helper.
> >> >  (vect_model_store_cost): When vectorizing a store to a decl
> >> >  we return and the function ABI returns in a multi-reg location
> >> >  account for the possible spilling that will happen.
> >> >
> >> >  * gcc.target/i386/pr84101.c: New testcase.
> >> >
> >> > Index: gcc/tree-vect-stmts.c
> >> > ===================================================================
> >> > --- gcc/tree-vect-stmts.c        (revision 269987)
> >> > +++ gcc/tree-vect-stmts.c        (working copy)
> >> > @@ -43,6 +43,7 @@ along with GCC; see the file COPYING3.
> >> >  #include "tree-cfg.h"
> >> >  #include "tree-ssa-loop-manip.h"
> >> >  #include "cfgloop.h"
> >> > +#include "explow.h"
> >> >  #include "tree-ssa-loop.h"
> >> >  #include "tree-scalar-evolution.h"
> >> >  #include "tree-vectorizer.h"
> >> > @@ -52,6 +53,7 @@ along with GCC; see the file COPYING3.
> >> >  #include "vec-perm-indices.h"
> >> >  #include "tree-ssa-loop-niter.h"
> >> >  #include "gimple-fold.h"
> >> > +#include "regs.h"
> >> >  
> >> >  /* For lang_hooks.types.type_for_mode.  */
> >> >  #include "langhooks.h"
> >> > @@ -948,6 +950,37 @@ vect_model_promotion_demotion_cost (stmt
> >> >                       "prologue_cost = %d .\n", inside_cost, 
> >> > prologue_cost);
> >> >  }
> >> >  
> >> > +/* Returns true if the current function returns DECL.  */
> >> > +
> >> > +static bool
> >> > +cfun_returns (tree decl)
> >> > +{
> >> > +  edge_iterator ei;
> >> > +  edge e;
> >> > +  FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
> >> > +    {
> >> > +      greturn *ret = safe_dyn_cast <greturn *> (last_stmt (e->src));
> >> > +      if (!ret)
> >> > +        continue;
> >> > +      if (gimple_return_retval (ret) == decl)
> >> > +        return true;
> >> > +      /* We often end up with an aggregate copy to the result decl,
> >> > +         handle that case as well.  First skip intermediate clobbers
> >> > +         though.  */
> >> > +      gimple *def = ret;
> >> > +      do
> >> > +        {
> >> > +          def = SSA_NAME_DEF_STMT (gimple_vuse (def));
> >> > +        }
> >> > +      while (gimple_clobber_p (def));
> >> > +      if (is_a <gassign *> (def)
> >> > +          && gimple_assign_lhs (def) == gimple_return_retval (ret)
> >> > +          && gimple_assign_rhs1 (def) == decl)
> >> > +        return true;
> >> > +    }
> >> > +  return false;
> >> > +}
> >> > +
> >> >  /* Function vect_model_store_cost
> >> >  
> >> >     Models cost for stores.  In the case of grouped accesses, one access
> >> > @@ -1032,6 +1065,37 @@ vect_model_store_cost (stmt_vec_info stm
> >> >                                         vec_to_scalar, stmt_info, 0, 
> >> > vect_body);
> >> >      }
> >> >  
> >> > +  /* When vectorizing a store into the function result assign
> >> > +     a penalty if the function returns in a multi-register location.
> >> > +     In this case we assume we'll end up with having to spill the
> >> > +     vector result and do piecewise loads.  */
> >> > +  tree base = get_base_address (STMT_VINFO_DATA_REF (stmt_info)->ref);
> >> > +  if (base
> >> > +      && (TREE_CODE (base) == RESULT_DECL
> >> > +          || (DECL_P (base) && cfun_returns (base)))
> >> > +      && !aggregate_value_p (base, cfun->decl))
> >> > +    {
> >> > +      rtx reg = hard_function_value (TREE_TYPE (base), cfun->decl, 0, 
> >> > 1);
> >> > +      /* ???  Handle PARALLEL in some way.  */
> >> > +      if (REG_P (reg))
> >> > +        {
> >> > +          int nregs = hard_regno_nregs (REGNO (reg), GET_MODE (reg));
> >> > +          /* Assume that a reg-reg move is possible and cheap,
> >> > +             do not account for vector to gp register move cost.  */
> >> > +          if (nregs > 1)
> >> 
> >> Looks OK to me FWIW, but maybe it would be worth having a check like:
> >> 
> >>     targetm.secondary_memory_needed (TYPE_MODE (vectype), ALL_REGS,
> >>                                 REGNO_REG_CLASS (REGNO (reg)))
> >> 
> >> as well as the above, so that we don't accidentally penalise values
> >> that are returned in multiple vector registers.  Looks like the i386.c
> >> definition will return true in the cases that matter.
> >
> > I wonder if what this hook returns makes sense if you feed it modes
> > with different sizes?  Say for the testcase V2DImode and
> > the general regs class (DImode).
> 
> In this case I think it's between V2DImode and TImode.  I went with
> the vector mode because ABIs are the one place where BLKmode registers
> are a thing.  Guess that's not a problem here though, because nregs > 1
> would be false for BLKmode.  So yeah, GET_MODE (reg) should be fine too.

Ah, OK.  Though then the target would see TImode, ALL_REGS, general-regs
which isn't what we want to ask...  Leaves us wit V2DImode, ALL_REGS,
general-regs then.

> > No "direct" moves exist because the
> > result doesn't fit anyway so I wonder about the semantic of
> > secondary_memory_needed here.  Wouldn't it be more appropriate to
> > specify the correct register class instead of ALL_REGS here and
> > choose GET_MODE (reg) for the mode instead?  Because in the end
> > we'll do N GET_MODE (reg) moves from the vector register class to
> > the reg regclass?  OTOH I have no idea how to get at the register
> > class of 'vectype' ...?
> 
> I think we only end up asking this question if the target lets both
> register classes hold as many bytes as are in the mode.  It might need
> multiple registers, but that's OK.
> 
> And yeah, it would be nice to be more precise than ALL_REGS, but I don't
> think there's a reliable way of getting the "preferred" register class
> for a mode in this context.
> 
> The hook is supposed to be conservatively correct for ALL_REGS,
> so the above should err on the side of including the cost.

OK, I see.

> > But yes, something like that might be an improvement.  Still
> > x86 _can_ do direct moves between xmm and general regs
> > (some tunings prefer to go through memory) but the issue is
> > really that the RTL expander will emit code that later the
> > register allocator is not able to mangle into a form w/o
> > spills.  The issue there is really the upper half extracts
> > of the vector register.  So I'm not so sure that
> > targetm.secondary_memory_needed is the correct tool to use
> > for this heuristic - it is after all an RTL "optimization"
> > issue we are working around.
> 
> But the moves do seem to be "proper" 128-bit moves:
> 
> (insn 11 10 12 2 (set (reg:TI 87 [ D.1914 ])
>         (subreg:TI (reg:V2DI 90) 0)) "/tmp/f2.c":18:10 -1
>      (nil))
> (insn 12 11 16 2 (set (reg:TI 88 [ <retval> ])
>         (reg:TI 87 [ D.1914 ])) "/tmp/f2.c":18:10 -1
>      (nil))
> (insn 16 12 17 2 (set (reg/i:TI 0 ax)
>         (reg:TI 88 [ <retval> ])) "/tmp/f2.c":19:1 -1
>      (nil))
> 
> which eventually becomes:
> 
> (insn 16 10 17 2 (set (reg/i:TI 0 ax)
>         (subreg:TI (reg:V2DI 90) 0)) "/tmp/f2.c":19:1 65 {*movti_internal}
>      (expr_list:REG_DEAD (reg:V2DI 90)
>         (nil)))
> 
> So I don't think it's an RTL optimisation problem as such.  Some targets
> (like AArch64) can do the equivalent of this without spills in either
> direction.  Similarly, MIPS32 can use mthc1/mtc1 and mfhc1/mfc1 for
> moving 64-bit FPRs to and from 32-bit GPRs.
> 
> So I think it really is up to the target to decide whether it can/wants
> to do this without spills or not.  It seems wrong to cost in a spill
> that should never happen. :-)

True.  But as you say below it's only half of the story since
the vec_concat will still happen (and that is also costly, even
more so if going through memory due to STLF issues).

> If there is a case that needs spills, secondary_memory_reload has to
> return true for that case.  So I think it's the right hook to test.
> 
> > And we'd even like to penalize the non-memory case because going from
> >
> >   pair:
> >    lea    (%rdi,%rdi,1),%eax
> >    sar    %edi
> >    movslq %edi,%rdi
> >    cltq   
> >    mov    %rax,-0x18(%rsp)
> >    movq   -0x18(%rsp),%xmm0
> >    mov    %rdi,-0x18(%rsp)
> >    movhps -0x18(%rsp),%xmm0
> >    movaps %xmm0,-0x18(%rsp)
> >    mov    -0x18(%rsp),%rax
> >    mov    -0x10(%rsp),%rdx
> >    retq
> >
> > to
> >
> >   pair:
> >    lea    (%rdi,%rdi,1),%eax
> >    sar    %edi
> >    movslq %edi,%rdi
> >    cltq   
> >    mov    %rax,-0x18(%rsp)
> >    movq   -0x18(%rsp),%xmm0
> >    mov    %rdi,-0x18(%rsp)
> >    movhps -0x18(%rsp),%xmm0
> >    movlps %xmm0, %rdx
> >    movhps %xmm0, %rax
> >    retq
> >
> > is still worse than not vectorizing that copy (just in
> > case targetm.secondary_memory_needed says false which
> > it might eventually do because we strictly do not need
> > secondary memory here).
> 
> But isn't the vector construction the expensive part of that?
> I assume we wouldn't want it even if %xmm0 was stored directly
> to memory rather than returned in GPRs.
> 
> > Quoting the x86 implementation of the hook it seems we're
> > "saved" by the mode size check only:
> >
> >   if (SSE_CLASS_P (class1) != SSE_CLASS_P (class2))
> >     {
> >       /* SSE1 doesn't have any direct moves from other classes.  */
> >       if (!TARGET_SSE2)
> >         return true;
> >
> >       /* If the target says that inter-unit moves are more expensive
> >          than moving through memory, then don't generate them.  */
> >       if ((SSE_CLASS_P (class1) && !TARGET_INTER_UNIT_MOVES_FROM_VEC)
> >           || (SSE_CLASS_P (class2) && !TARGET_INTER_UNIT_MOVES_TO_VEC))
> >         return true;
> >
> >       /* Between SSE and general, we have moves no larger than word size.  
> > */
> >       if (GET_MODE_SIZE (mode) > UNITS_PER_WORD)
> >         return true;
> 
> Yeah.
> 
> > Your case where a target returns in multiple vector registers is one
> > not well handled by the patch in general - there isn't any attempt
> > at looking at the shape of the RESULT_DECL, we just see if a
> > vector store goes somewhere into it.  And we do this when processing
> > each store.  As said this is mostly a hack that covers the cases
> > the issue was reported for and shouldn't penaltize most other
> > cases.  Yes, it might pessimize a target that returns large
> > aggregates in multiple vector registers, but is there any?
> 
> SVE does this in some cases, but I guess none of them are relevant here.
> 
> > So, may I go with the original patch?
> 
> Still feels like we're counting spills on targets that shouldn't need them.
> But going back to:
> 
>         /* Assume that a reg-reg move is possible and cheap,
>            do not account for vector to gp register move cost.  */
> 
> I guess we could gloss over the "unnecessary" spill cost by saying that,
> even if the spill isn't needed, this is a conservative estimate of the
> vector to gp register move cost?

Yes.  What I really tried to ask is - am I going to need the
vectorized result piecewise in the end (but not being in control
of the chopping into pieces)?  I wanted to pessimize that with
an estimate of the "chopping cost".  I probably shouldn't have
talked about spilling but that's the usual straight-forward
solution of extracting sth from a larger register that works
everywhere.

So I guess I'll update the comment and install as-is?

I still hope for a better solution either on the target or the
RTL optimization side (undoing the vectorization).

Richard.

Reply via email to