RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

Yangfei (Felix) Fri, 22 May 2020 00:01:31 -0700

Hi Richard,

    Thanks for the suggestions.


> -----Original Message-----
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Thursday, May 21, 2020 5:22 PM
> To: Yangfei (Felix) <felix.y...@huawei.com>
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with
> fixed sve vector length
> 
> "Yangfei (Felix)" <felix.y...@huawei.com> writes:
> > Hi,
> >
> >   Notice a tiny SVE-related performance issue:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95254
> >
> >   For the given test case, SLP succeeds with VNx8HImode with or without
> option -msve-vector-bits=256.
> >   The root cause for the difference is that we choose a different mode in
> aarch64_vectorize_related_mode under -msve-vector-bits=256:
> VNx2HImode instead of V4HImode.
> >   Then in the final tree ssa forwprop pass, we need to do a VIEW_CONVERT
> from V4HImode to VNx2HImode.
> >
> >   PATCH catch and simplify the pattern in
> aarch64_expand_sve_mem_move, emitting a mov pattern of V4HImode
> instead.
> >   I am assuming endianness does not make a difference here considering
> this simplification.
> >   Bootstrap and tested on aarch64-linux-gnu.  Comments?
> 
> I think we should try to catch this at the gimple level, possibly during SLP 
> itself.

Agreed.  It's better if this can be handled in SLP. 
For the given test case, the difference reflect itself after the final ssa 
forwprop. 
So I guess it might be hard for SLP to catch and evaluate in the vect cost 
model.

> Although the patch handles the case in which the V4HI is stored directly to
> memory, I assume it won't help if the code instead does:
> 
>     for (i = 0; i < 4; i++)
>       b[i] = u.a[i] + 1;

SLP failed if you modify like that.  

> targetm.can_change_mode_class (..., ALL_REGS) would be a good indicator
> of whether the needed VIEW_CONVERT_EXPR is cheap or expensive.
> 
> I agree it might still be worth handling this in the move patterns too.
> It feels like a target-independent optimisation though, and for example
> should also apply to V4HI moves involving subregs of VNx2HIs.
> 
> So I think it would be worth trying to do this in emit_move_insn.
> In principle it would be useful for:
> 
>   // M1 and M2 equal size, !targetm.can_change_mode_class (M1, M2,
> ALL_REGS)
> 
>   (set (subreg:M1 (reg:M2 ...)) (subreg:M1 (reg:M2 ...)))
>   (set (subreg:M1 (reg:M2 ...)) (mem:M1 ADDR))
>   (set (mem:M1 ADDR) (subreg:M1 (reg:M2 ...)))
>   (set (subreg:M1 (reg:M2 ...)) (constant C))
> 
> It would be nice if we could do this even without the
> can_change_mode_class condition, provided that it doesn't turn valid M1
> constants or MEMs into invalid M2 ones (or at least, M2 ones that need to be
> validated).
> Unfortunately, going that far is likely to interfere with target-specific 
> choices,
> so it's probably too dangerous.
> 
> With the can_change_mode_class condition it should be fine though, since
> it's avoiding an implicit round trip through memory.  The change should be a
> win even if the MEMs or constants need legitimising for M2.

Good suggestion.  I worked out a v2 patch with the logic moved to in 
emit_move_insn.
For (set (subreg:M1 (reg:M2 ...)) (constant C)) case, I am simply requiring C 
should also be a legitimate constant for M2.
It works for the test case.  I will do a full test for the v2 patch if you like 
it.

> > [...]
> > +  if (inner != NULL_RTX
> > +      && aarch64_classify_vector_mode (inner_mode) == VEC_ADVSIMD
> > +      && GET_MODE_INNER (mode) == GET_MODE_INNER (inner_mode)
> > +      && known_eq (GET_MODE_SIZE (mode), GET_MODE_SIZE
> (inner_mode))
> > +      && GET_MODE_BITSIZE (inner_mode).to_constant () <= alignment)
> > +    {
> > +      rtx addr, mem;
> > +      if (MEM_P (src))
> > +   {
> > +     addr = XEXP (src, 0);
> > +     mem = gen_rtx_MEM (inner_mode, addr);
> > +     emit_move_insn (inner, mem);
> > +   }
> > +      else
> > +   {
> > +     addr = XEXP (dest, 0);
> > +     mem = gen_rtx_MEM (inner_mode, addr);
> > +     emit_move_insn (mem, inner);
> 
> gen_rtx_MEM shouldn't be used to create new versions of existing MEMs,
> since it drops all the attributes.  It's better to use something like
> adjust_address instead.  That will also take care of making sure that the
> address is valid for the new mode.

It was my fault.  I should have realized that difference.  :-)

Felix

pr95254-v2.diff
Description: pr95254-v2.diff

RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

Reply via email to