correct ThunderX 1 cost model for Arith_shift

James Greenhalgh Wed, 21 Jun 2017 04:14:05 -0700

On Tue, Jun 20, 2017 at 02:07:22PM -0700, Andrew Pinski wrote:
> On Mon, Jun 19, 2017 at 2:00 PM, Andrew Pinski <pins...@gmail.com> wrote:
> > On Wed, Jun 7, 2017 at 10:16 AM, James Greenhalgh
> > <james.greenha...@arm.com> wrote:
> >> On Fri, Dec 30, 2016 at 10:05:26PM -0800, Andrew Pinski wrote:
> >>> Hi,
> >>>   Currently for the following function:
> >>> int f(int a, int b)
> >>> {
> >>>   return a + (b <<7);
> >>> }
> >>>
> >>> GCC produces:
> >>> add     w0, w0, w1, lsl 7
> >>> But for ThunderX 1, it is better if the instruction was split allowing
> >>> better scheduling to happen in most cases, the latency is the same.  I
> >>> get a small improvement in coremarks, ~1%.
> >>>
> >>> Currently the code does not take into account Arith_shift even though
> >>> the comment:
> >>>   /* Strip any extend, leave shifts behind as we will
> >>>     cost them through mult_cost.  */
> >>> Say it does not strip out the shift, aarch64_strip_extend does and has
> >>> always has since the back-end was added to GCC.
> >>>
> >>> Once I fixed the code around aarch64_strip_extend, I got a regression
> >>> for ThunderX 1 as some shifts/extends (left shifts <=4 and/or zero
> >>> extends) are considered free so I needed to add a new tuning flag.
> >>>
> >>> Note I will get an even more improvement for ThunderX 2 CN99XX, but I
> >>> have not measured it yet as I have not made the change to
> >>> aarch64-cost-tables.h yet as I am waiting for approval of the renaming
> >>> patch first before submitting any of the cost table changes.  Also I
> >>> noticed this problem with this tuning first and then looked back at
> >>> what I needed to do for ThunderX 1.
> >>>
> >>> OK?  Bootstrapped and tested on aarch64-linux-gnu without any
> >>> regressions (both with and without --with-cpu=thunderx).
> >>
> >> This is mostly OK, but I don't like the name "easy"_shift_extend. Cheap
> >> or free seems better. I have some other minor points below.
> >
> >
> > Ok, that seems like a good idea.  I used easy since that was the
> > wording our hardware folks had came up with.  I am changing the
> > comments to make clearer when this flag should be used.
> > I should a new patch out by the end of today.
> 
> Due to the LSE ICE which I reported in the other thread, it took me
> longer to send out a new patch.
> Anyways here is the updated patch with the changes requested.
> 
> 
> OK? Bootstrapped and tested on aarch64-linux-gnu with no regressions.


One grammar fix inline below, otherwise this is OK.

Thanks,
James

> * config/aarch64/aarch64-cost-tables.h (thunderx_extra_costs):
> Increment Arith_shift and Arith_shift_reg by 1.
> * config/aarch64/aarch64-tuning-flags.def (cheap_shift_extend): New tuning 
> flag.
> * config/aarch64/aarch64.c (thunderx_tunings): Enable
> AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND.
> (aarch64_strip_extend): Add new argument and test for it.
> (aarch64_cheap_mult_shift_p): New function.
> (aarch64_rtx_mult_cost): Call aarch64_cheap_mult_shift_p and don't add
> a cost if it is true.
> Update calls to aarch64_strip_extend.
> (aarch64_rtx_costs): Update calls to aarch64_strip_extend.
> 
> +
> +/* Return true iff X is an cheap shift without a sign extend. */

s/an cheap/a cheap/

> +
> +static bool
> +aarch64_cheap_mult_shift_p (rtx x)
> +{
> +  rtx op0, op1;
> +
> +  op0 = XEXP (x, 0);
> +  op1 = XEXP (x, 1);
> +
> +  if (!(aarch64_tune_params.extra_tuning_flags
> +                      & AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND))
> +    return false;
> +
> +  if (GET_CODE (op0) == SIGN_EXTEND)
> +    return false;
> +
> +  if (GET_CODE (x) == ASHIFT && CONST_INT_P (op1)
> +      && UINTVAL (op1) <= 4)
> +    return true;
> +
> +  if (GET_CODE (x) != MULT || !CONST_INT_P (op1))
> +    return false;
> +
> +  HOST_WIDE_INT l2 = exact_log2 (INTVAL (op1));
> +
> +  if (l2 > 0 && l2 <= 4)
> +    return true;
> +
> +  return false;
> +}
> +
>  /* Helper function for rtx cost calculation.  Calculate the cost of
>     a MULT or ASHIFT, which may be part of a compound PLUS/MINUS rtx.
>     Return the calculated cost of the expression, recursing manually in to
> @@ -6164,7 +6200,11 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_c
>           {
>             if (compound_p)
>               {
> -               if (REG_P (op1))
> +               /* If the shift is considered cheap,
> +                  then don't add any cost. */
> +               if (aarch64_cheap_mult_shift_p (x))
> +                 ;
> +               else if (REG_P (op1))
>                   /* ARITH + shift-by-register.  */
>                   cost += extra_cost->alu.arith_shift_reg;
>                 else if (is_extend)
> @@ -6182,7 +6222,7 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_c
>           }
>         /* Strip extends as we will have costed them in the case above.  */
>         if (is_extend)
> -         op0 = aarch64_strip_extend (op0);
> +         op0 = aarch64_strip_extend (op0, true);
>  
>         cost += rtx_cost (op0, VOIDmode, code, 0, speed);
>  
> @@ -7026,13 +7066,13 @@ cost_minus:
>           if (speed)
>             *cost += extra_cost->alu.extend_arith;
>  
> -         op1 = aarch64_strip_extend (op1);
> +         op1 = aarch64_strip_extend (op1, true);
>           *cost += rtx_cost (op1, VOIDmode,
>                              (enum rtx_code) GET_CODE (op1), 0, speed);
>           return true;
>         }
>  
> -     rtx new_op1 = aarch64_strip_extend (op1);
> +     rtx new_op1 = aarch64_strip_extend (op1, false);
>  
>       /* Cost this as an FMA-alike operation.  */
>       if ((GET_CODE (new_op1) == MULT
> @@ -7105,7 +7145,7 @@ cost_plus:
>           if (speed)
>             *cost += extra_cost->alu.extend_arith;
>  
> -         op0 = aarch64_strip_extend (op0);
> +         op0 = aarch64_strip_extend (op0, true);
>           *cost += rtx_cost (op0, VOIDmode,
>                              (enum rtx_code) GET_CODE (op0), 0, speed);
>           return true;
> @@ -7113,7 +7153,7 @@ cost_plus:
>  
>       /* Strip any extend, leave shifts behind as we will
>          cost them through mult_cost.  */
> -     new_op0 = aarch64_strip_extend (op0);
> +     new_op0 = aarch64_strip_extend (op0, false);
>  
>       if (GET_CODE (new_op0) == MULT
>           || aarch64_shift_p (GET_CODE (new_op0)))

Re: [PATCH/AARCH64] Improve/correct ThunderX 1 cost model for Arith_shift

Reply via email to