Tamar Christina <tamar.christ...@arm.com> writes:
> Hi Richard,
> The 06/08/2020 16:42, Richard Sandiford wrote:
>> Tamar Christina <tamar.christ...@arm.com> writes:
>> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> > index 
>> > 97da60762390db81df9cffaf316b909cd1609130..9cc8da338125afa01bc9fb645f4112d2d7ef548c
>> >  100644
>> > --- a/gcc/config/aarch64/aarch64.c
>> > +++ b/gcc/config/aarch64/aarch64.c
>> > @@ -11279,6 +11279,14 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, 
>> > int outer, bool speed)
>> >    if (VECTOR_MODE_P (mode))
>> >      mode = GET_MODE_INNER (mode);
>> > 
>> > +  /* The by element versions of the instruction has the same costs as the
>> > +     normal 3 vector version.  So don't add the costs of the duplicate 
>> > into
>> > +     the costs of the multiply.  */
>> > +  if (GET_CODE (op0) == VEC_DUPLICATE)
>> > +    op0 = XEXP (op0, 0);
>> > +  else if (GET_CODE (op1) == VEC_DUPLICATE)
>> > +    op1 = XEXP (op1, 0);
>> > +
>> >    /* Integer multiply/fma.  */
>> >    if (GET_MODE_CLASS (mode) == MODE_INT)
>> >      {
>> 
>> SVE doesn't have duplicating forms, so I think we should put this code
>> under the “if (VECTOR_MODE_P (mode))” condition, before changing “mode”,
>> and then restrict it to VEC_ADVSIMD modes.
>> 
>> (SVE FMUL does have an indexed form, but the index is relative to the
>> start of the associated quadword, so it isn't a VEC_DUPLICATE.)
>>
>
> Done, I have updated the patch. (See attached)
>  
>> I guess there's a danger that this could underestimate the cost for
>> integer modes, if the scalar integer input needs to be moved from GPRs.
>> In that case the cost of a MULT + VEC_DUPLICATE is probably more
>> accurate, even though it's still one instruction before RA.
>> 
>> But I guess there's no perfect answer there.  The new code will be
>> right for integer modes in some cases and not in others.  Same if
>> we leave things as they are.  But maybe it'd be worth having a comment
>> to say that we're assuming the best case, i.e. that the duplicated
>> value is naturally in FPRs?
>> 
>
> Hmm I haven't added the comment yet since I don't fully understand when the
> integer case would be misleading.
>
> In both cases the cost for the GPR is paid by the MOV no? I'm missing
> why having the MUL account for it would be better in some cases.

The point was that any MOV isn't exposed until after register allocation,
whereas costs are usually applied before then.  So before RA:

> For instance for the integer case we used to generate
>
>         dup     v0.4s, w2
>         mul     v2.4s, v2.4s, v0.4s

...this was costed as:

   (set (reg:V4SI R2) (vec_duplicate:V4SI (reg:SI R1)))
   (set (reg:V4SI R3) (mult:V4SI ...))

and so accurate when R1 naturally ends up in a GPR.

> but now do
>
>         fmov    s0, w2
>         mul     v2.4s, v2.4s, v0.s[0]

...and this is costed as:

   (set (reg:V4SI R3) (mult:V4SI ...))

and so accurate when R1 naturally ends up in an FPR (without needing
a reload to put it there).

In other words, before RA, the patch is making the optimistic assumption
that R1 is already in FPRs and so a separate FMOV won't be needed.

Thanks,
Richard

> Which is better on older cores such Cortex-A55 and no different on newer 
> cores such as
> Cortex-A76 according to the optimization guides.
>
> Regards,
> Tamar
>
>> Thanks,
>> Richard

Reply via email to