On Tue, Jun 20, 2017 at 02:07:22PM -0700, Andrew Pinski wrote: > On Mon, Jun 19, 2017 at 2:00 PM, Andrew Pinski <pins...@gmail.com> wrote: > > On Wed, Jun 7, 2017 at 10:16 AM, James Greenhalgh > > <james.greenha...@arm.com> wrote: > >> On Fri, Dec 30, 2016 at 10:05:26PM -0800, Andrew Pinski wrote: > >>> Hi, > >>> Currently for the following function: > >>> int f(int a, int b) > >>> { > >>> return a + (b <<7); > >>> } > >>> > >>> GCC produces: > >>> add w0, w0, w1, lsl 7 > >>> But for ThunderX 1, it is better if the instruction was split allowing > >>> better scheduling to happen in most cases, the latency is the same. I > >>> get a small improvement in coremarks, ~1%. > >>> > >>> Currently the code does not take into account Arith_shift even though > >>> the comment: > >>> /* Strip any extend, leave shifts behind as we will > >>> cost them through mult_cost. */ > >>> Say it does not strip out the shift, aarch64_strip_extend does and has > >>> always has since the back-end was added to GCC. > >>> > >>> Once I fixed the code around aarch64_strip_extend, I got a regression > >>> for ThunderX 1 as some shifts/extends (left shifts <=4 and/or zero > >>> extends) are considered free so I needed to add a new tuning flag. > >>> > >>> Note I will get an even more improvement for ThunderX 2 CN99XX, but I > >>> have not measured it yet as I have not made the change to > >>> aarch64-cost-tables.h yet as I am waiting for approval of the renaming > >>> patch first before submitting any of the cost table changes. Also I > >>> noticed this problem with this tuning first and then looked back at > >>> what I needed to do for ThunderX 1. > >>> > >>> OK? Bootstrapped and tested on aarch64-linux-gnu without any > >>> regressions (both with and without --with-cpu=thunderx). > >> > >> This is mostly OK, but I don't like the name "easy"_shift_extend. Cheap > >> or free seems better. I have some other minor points below. > > > > > > Ok, that seems like a good idea. I used easy since that was the > > wording our hardware folks had came up with. I am changing the > > comments to make clearer when this flag should be used. > > I should a new patch out by the end of today. > > Due to the LSE ICE which I reported in the other thread, it took me > longer to send out a new patch. > Anyways here is the updated patch with the changes requested. > > > OK? Bootstrapped and tested on aarch64-linux-gnu with no regressions.
One grammar fix inline below, otherwise this is OK. Thanks, James > * config/aarch64/aarch64-cost-tables.h (thunderx_extra_costs): > Increment Arith_shift and Arith_shift_reg by 1. > * config/aarch64/aarch64-tuning-flags.def (cheap_shift_extend): New tuning > flag. > * config/aarch64/aarch64.c (thunderx_tunings): Enable > AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND. > (aarch64_strip_extend): Add new argument and test for it. > (aarch64_cheap_mult_shift_p): New function. > (aarch64_rtx_mult_cost): Call aarch64_cheap_mult_shift_p and don't add > a cost if it is true. > Update calls to aarch64_strip_extend. > (aarch64_rtx_costs): Update calls to aarch64_strip_extend. > > + > +/* Return true iff X is an cheap shift without a sign extend. */ s/an cheap/a cheap/ > + > +static bool > +aarch64_cheap_mult_shift_p (rtx x) > +{ > + rtx op0, op1; > + > + op0 = XEXP (x, 0); > + op1 = XEXP (x, 1); > + > + if (!(aarch64_tune_params.extra_tuning_flags > + & AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND)) > + return false; > + > + if (GET_CODE (op0) == SIGN_EXTEND) > + return false; > + > + if (GET_CODE (x) == ASHIFT && CONST_INT_P (op1) > + && UINTVAL (op1) <= 4) > + return true; > + > + if (GET_CODE (x) != MULT || !CONST_INT_P (op1)) > + return false; > + > + HOST_WIDE_INT l2 = exact_log2 (INTVAL (op1)); > + > + if (l2 > 0 && l2 <= 4) > + return true; > + > + return false; > +} > + > /* Helper function for rtx cost calculation. Calculate the cost of > a MULT or ASHIFT, which may be part of a compound PLUS/MINUS rtx. > Return the calculated cost of the expression, recursing manually in to > @@ -6164,7 +6200,11 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_c > { > if (compound_p) > { > - if (REG_P (op1)) > + /* If the shift is considered cheap, > + then don't add any cost. */ > + if (aarch64_cheap_mult_shift_p (x)) > + ; > + else if (REG_P (op1)) > /* ARITH + shift-by-register. */ > cost += extra_cost->alu.arith_shift_reg; > else if (is_extend) > @@ -6182,7 +6222,7 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_c > } > /* Strip extends as we will have costed them in the case above. */ > if (is_extend) > - op0 = aarch64_strip_extend (op0); > + op0 = aarch64_strip_extend (op0, true); > > cost += rtx_cost (op0, VOIDmode, code, 0, speed); > > @@ -7026,13 +7066,13 @@ cost_minus: > if (speed) > *cost += extra_cost->alu.extend_arith; > > - op1 = aarch64_strip_extend (op1); > + op1 = aarch64_strip_extend (op1, true); > *cost += rtx_cost (op1, VOIDmode, > (enum rtx_code) GET_CODE (op1), 0, speed); > return true; > } > > - rtx new_op1 = aarch64_strip_extend (op1); > + rtx new_op1 = aarch64_strip_extend (op1, false); > > /* Cost this as an FMA-alike operation. */ > if ((GET_CODE (new_op1) == MULT > @@ -7105,7 +7145,7 @@ cost_plus: > if (speed) > *cost += extra_cost->alu.extend_arith; > > - op0 = aarch64_strip_extend (op0); > + op0 = aarch64_strip_extend (op0, true); > *cost += rtx_cost (op0, VOIDmode, > (enum rtx_code) GET_CODE (op0), 0, speed); > return true; > @@ -7113,7 +7153,7 @@ cost_plus: > > /* Strip any extend, leave shifts behind as we will > cost them through mult_cost. */ > - new_op0 = aarch64_strip_extend (op0); > + new_op0 = aarch64_strip_extend (op0, false); > > if (GET_CODE (new_op0) == MULT > || aarch64_shift_p (GET_CODE (new_op0)))