> -----Original Message-----
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: Tuesday, August 31, 2021 4:14 PM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw
> <richard.earns...@arm.com>; Marcus Shawcroft
> <marcus.shawcr...@arm.com>; Kyrylo Tkachov <kyrylo.tkac...@arm.com>
> Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector constants
> and operations
> 
> Tamar Christina <tamar.christ...@arm.com> writes:
> > @@ -13936,8 +13937,65 @@ cost_plus:
> >                          mode, MULT, 1, speed);
> >            return true;
> >          }
> > +   break;
> > +    case PARALLEL:
> > +      /* Fall through */
> 
> Which code paths lead to getting a PARALLEL here?

Hi,

Thanks for the review!

I added it for completeness because CSE treats a parallel and CONST_VECTOR as
equivalent when they each entry in the parallel defines a constant.

> 
> > +    case CONST_VECTOR:
> > +   {
> > +     rtx gen_insn = aarch64_simd_make_constant (x, true);
> > +     /* Not a valid const vector.  */
> > +     if (!gen_insn)
> > +       break;
> >
> > -      /* Fall through.  */
> > +     switch (GET_CODE (gen_insn))
> > +     {
> > +     case CONST_VECTOR:
> > +       /* Load using MOVI/MVNI.  */
> > +       if (aarch64_simd_valid_immediate (x, NULL))
> > +         *cost += extra_cost->vect.movi;
> > +       else /* Load using constant pool.  */
> > +         *cost += extra_cost->ldst.load;
> > +       break;
> > +     /* Load using a DUP.  */
> > +     case VEC_DUPLICATE:
> > +       *cost += extra_cost->vect.dup;
> > +       break;
> 
> Does this trigger in practice?  The new check==true path (rightly) stops the
> duplicated element from being forced into a register, but then I would have
> expected:
> 
> rtx
> gen_vec_duplicate (machine_mode mode, rtx x) {
>   if (valid_for_const_vector_p (mode, x))
>     return gen_const_vec_duplicate (mode, x);
>   return gen_rtx_VEC_DUPLICATE (mode, x); }
> 
> to generate the original CONST_VECTOR again.

Yes, but CSE is trying to see whether using a DUP is cheaper than another 
instruction.
Normal code won't hit this but CSE is just costing all the different ways one 
can semantically
construct a vector, which RTL actually comes out of it depends on how it's 
folded as you say.

> 
> > +     default:
> > +       *cost += extra_cost->ldst.load;
> > +       break;
> > +     }
> > +     return true;
> > +   }
> > +    case VEC_CONCAT:
> > +   /* depending on the operation, either DUP or INS.
> > +      For now, keep default costing.  */
> > +   break;
> > +    case VEC_DUPLICATE:
> > +   *cost += extra_cost->vect.dup;
> > +   return true;
> > +    case VEC_SELECT:
> > +   {
> > +     /* cost subreg of 0 as free, otherwise as DUP */
> > +     rtx op1 = XEXP (x, 1);
> > +     int nelts;
> > +     if ((op1 == const0_rtx && !BYTES_BIG_ENDIAN)
> > +         || (BYTES_BIG_ENDIAN
> > +             && GET_MODE_NUNITS (mode).is_constant(&nelts)
> > +             && INTVAL (op1) == nelts - 1))
> > +       ;
> > +     else if (vec_series_lowpart_p (mode, GET_MODE (op1), op1))
> > +       ;
> > +     else if (vec_series_highpart_p (mode, GET_MODE (op1), op1))
> > +     /* Selecting the high part is not technically free, but we lack
> > +        enough information to decide that here.  For instance selecting
> > +        the high-part of a vec_dup *is* free or to feed into any _high
> > +        instruction.   Both of which we can't really tell.  That said
> > +        have a better chance to optimize an dup vs multiple constants.  */
> > +       ;
> 
> Not sure about this.  We already try to detect the latter case (_high
> instructions) via aarch64_strip_extend_vec_half.  We might be missing some
> cases, but that still feels like the right way to go IMO.

That's a different problem from what I understand.  What this is trying to say 
is that
If you have a vector [x y a b] and you need vector [x y] that you can use the 
top part
of the original vector for this.

This is an approximation, because something that can be created with a movi is 
probably
Cheaper to keep distinct if it's not going to be paired with a _high operation 
(since you will have a dup then).

The problem is that the front end has already spit the two Vectors into [x y a 
b] and [x y].
There's nothing else that tries to consolidate them back up if both survive.

As a consequence of this, the testcase test0 is not handled optimally.  It 
would instead create
2 vectors, both of movi 0x3, just one being 64-bits and one being 128-bits.

So if the cost of selecting it is cheaper than the movi, cse will not 
consolidate the vectors,
and because movi's are so cheap, the only cost that worked was 0.  But 
increasing the costs
of movi's requires the costs of everything to be increased (including loads).

I preferred to 0 out the cost, because the worst that can happen is an dup 
instead of a movi,
And at best a dup instead of a load from a pool (if the constant is 
complicated).

> 
> Selecting the high part of a vec_dup should get folded into another vec_dup.
> 
> The lowpart bits look OK, but which paths call this function without first
> simplifying the select to a subreg?  The subreg is now the canonical form
> (thanks to r12-2288).

The simplification will happen during folding in cse or in combine.  This 
costing happens before the folding,
When CSE is trying to decide whether to undo the front end's lowering of 
constants.

To do so it models the constants and the semantic operation required to extract 
them. E.g. to get
2 out of [0 2 4 5] it would need a VEC_SELECT of 1. And I don't treat the first 
element/bottom part special
Here.  Costing wise they would be the same.

Regards,
Tamar
> 
> Thanks,
> Richard

Reply via email to