On Wed, Apr 27, 2016 at 04:15:45PM -0500, Evandro Menezes wrote: > gcc/ > * config/aarch64/aarch64-protos.h > (aarch64_emit_approx_rsqrt): Replace with new function > "aarch64_emit_approx_sqrt". > (tune_params): New member "approx_sqrt_modes". > * config/aarch64/aarch64.c > (generic_tunings): New member "approx_rsqrt_modes". > (cortexa35_tunings): Likewise. > (cortexa53_tunings): Likewise. > (cortexa57_tunings): Likewise. > (cortexa72_tunings): Likewise. > (exynosm1_tunings): Likewise. > (thunderx_tunings): Likewise. > (xgene1_tunings): Likewise. > (aarch64_emit_approx_rsqrt): Replace with new function > "aarch64_emit_approx_sqrt". > (aarch64_override_options_after_change_1): Handle new option. > * config/aarch64/aarch64-simd.md > (rsqrt<mode>2): Use new function instead. > (sqrt<mode>2): New expansion and insn definitions. > * config/aarch64/aarch64.md: Likewise. > * config/aarch64/aarch64.opt > (mlow-precision-sqrt): Add new option description. > * doc/invoke.texi (mlow-precision-sqrt): Likewise. > > > diff --git a/gcc/config/aarch64/aarch64-protos.h > b/gcc/config/aarch64/aarch64-protos.h > index 50f1d24..437f6af 100644 > --- a/gcc/config/aarch64/aarch64-protos.h > +++ b/gcc/config/aarch64/aarch64-protos.h > @@ -244,6 +244,7 @@ struct tune_params > } autoprefetcher_model; > > unsigned int extra_tuning_flags; > + unsigned int approx_sqrt_modes; > unsigned int approx_rsqrt_modes; > };
This should go in struct recommended in 1/3 too. > > @@ -396,7 +397,7 @@ void aarch64_register_pragmas (void); > void aarch64_relayout_simd_types (void); > void aarch64_reset_previous_fndecl (void); > void aarch64_save_restore_target_globals (tree); > -void aarch64_emit_approx_rsqrt (rtx, rtx); > +bool aarch64_emit_approx_sqrt (rtx, rtx, bool); > > /* Initialize builtins for SIMD intrinsics. */ > void init_aarch64_simd_builtins (void); > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index 68381bf..589871b 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -38,6 +38,7 @@ > #include "recog.h" > #include "diagnostic.h" > #include "insn-attr.h" > +#include "insn-flags.h" Can you remember what you needed this include for? I couldn't spot it below and removing the include didn't seem to cause any trouble to the build. If it isn't needed, drop it. > #include "insn-modes.h" > #include "alias.h" > #include "fold-const.h" > @@ -7521,46 +7530,78 @@ get_rsqrts_type (machine_mode mode) > } > } > > -/* Emit instruction sequence to compute the reciprocal square root using the > - Newton-Raphson series. Iterate over the series twice for SF > - and thrice for DF. */ > +/* Emit instruction sequence to compute either the approximate square root > + or its approximate reciprocal. */ As you are updating this function a new parameter and a return value, please comment it appropriately. Describing the purpose of the parameter and the meaning of the return value. > -void > -aarch64_emit_approx_rsqrt (rtx dst, rtx src) > +bool > +aarch64_emit_approx_sqrt (rtx dst, rtx src, bool recp) > { > - machine_mode mode = GET_MODE (src); > - gcc_assert ( > - mode == SFmode || mode == V2SFmode || mode == V4SFmode > - || mode == DFmode || mode == V2DFmode); > - > - rtx xsrc = gen_reg_rtx (mode); > - emit_move_insn (xsrc, src); > - rtx x0 = gen_reg_rtx (mode); > + machine_mode mode = GET_MODE (dst); > + machine_mode mmsk = mode_for_vector (int_mode_for_mode (GET_MODE_INNER > (mode)), > + GET_MODE_NUNITS (mode)); > + > + if (!flag_finite_math_only > + || flag_trapping_math > + || !flag_unsafe_math_optimizations > + || optimize_function_for_size_p (cfun) > + || !((recp && (flag_mrecip_low_precision_sqrt > + || (aarch64_tune_params.approx_rsqrt_modes > + & AARCH64_APPROX_MODE (mode)))) > + || (!recp && (flag_mlow_precision_sqrt > + || (aarch64_tune_params.approx_sqrt_modes > + & AARCH64_APPROX_MODE (mode)))))) > + return false; Can you pull out these sub-clauses and assign them somewhere for clarity. These expressions are a bit too big to grok at a distance. I'd rather be reading: bool use_low_precision_rsqrt_p = recp && (flag_mrecip_low_precision_sqrt || (aarch64_tune_params.approx_rsqrt_modes & AARCH64_APPROX_MODE (mode))) <...> !((use_low_precision_sqrt_p) || (use_low_precision_rsqrt_p)) > - emit_insn ((*get_rsqrte_type (mode)) (x0, xsrc)); > + rtx xmsk = gen_reg_rtx (mmsk); > + if (!recp) > + /* When calculating the approximate square root, compare the argument > with > + 0.0 and create a mask. */ > + emit_insn (gen_rtx_SET (xmsk, gen_rtx_NEG (mmsk, gen_rtx_EQ (mmsk, src, > + CONST0_RTX (mode))))); > > - bool double_mode = (mode == DFmode || mode == V2DFmode); > + /* Estimate the approximate reciprocal square root. */ > + rtx xdst = gen_reg_rtx (mode); > + emit_insn ((*get_rsqrte_type (mode)) (xdst, src)); > > - int iterations = double_mode ? 3 : 2; > + /* Iterate over the series twice for SF and thrice for DF. */ > + int iterations = (GET_MODE_INNER (mode) == DFmode) ? 3 : 2; > > - /* Optionally iterate over the series one less time than otherwise. */ > - if (flag_mrecip_low_precision_sqrt) > + /* Optionally iterate over the series once less for faster performance > + while sacrificing the accuracy. */ > + if ((recp && flag_mrecip_low_precision_sqrt) > + || (!recp && flag_mlow_precision_sqrt)) > iterations--; > > - for (int i = 0; i < iterations; ++i) > + /* Iterate over the series to calculate the approximate reciprocal square > root. */ > + rtx x1 = gen_reg_rtx (mode); Why do you generate a register for x1 outside the loop but x2 inside it? Is it just because you want x1 to survive this scope? It seems you could either reuse a single x2 register too, or repeatedly generate a new x1 inside the loop too. > + while (iterations--) > { > - rtx x1 = gen_reg_rtx (mode); > rtx x2 = gen_reg_rtx (mode); > - rtx x3 = gen_reg_rtx (mode); > - emit_set_insn (x2, gen_rtx_MULT (mode, x0, x0)); > + emit_set_insn (x2, gen_rtx_MULT (mode, xdst, xdst)); > + > + emit_insn ((*get_rsqrts_type (mode)) (x1, src, x2)); > > - emit_insn ((*get_rsqrts_type (mode)) (x3, xsrc, x2)); > + if (iterations > 0) > + emit_set_insn (xdst, gen_rtx_MULT (mode, xdst, x1)); > + } > + > + if (!recp) > + { > + /* Qualify the approximate reciprocal square root when the argument is > + 0.0 by squashing the intermediary result to 0.0. */ > + rtx xtmp = gen_reg_rtx (mmsk); > + emit_set_insn (xtmp, gen_rtx_AND (mmsk, gen_rtx_NOT (mmsk, xmsk), > + gen_rtx_SUBREG (mmsk, xdst, 0))); > + emit_move_insn (xdst, gen_rtx_SUBREG (mode, xtmp, 0)); > > - emit_set_insn (x1, gen_rtx_MULT (mode, x0, x3)); > - x0 = x1; > + /* Calculate the approximate square root. */ > + emit_set_insn (xdst, gen_rtx_MULT (mode, xdst, src)); > } > > - emit_move_insn (dst, x0); > + /* Return the approximation. */ > + emit_set_insn (dst, gen_rtx_MULT (mode, xdst, x1)); > + > + return true; > } > > /* Return the number of instructions that can be issued per cycle. */ > @@ -8090,6 +8131,12 @@ aarch64_override_options_after_change_1 (struct > gcc_options *opts) > && (aarch64_cmodel == AARCH64_CMODEL_TINY > || aarch64_cmodel == AARCH64_CMODEL_TINY_PIC)) > aarch64_nopcrelative_literal_loads = false; > + > + /* When enabling the lower precision Newton series for the square root, > also > + enable it for the reciprocal square root, since the later is an > + intermediary step for the latter. */ > + if (flag_mlow_precision_sqrt) > + flag_mrecip_low_precision_sqrt = true; > } > > /* 'Unpack' up the internal tuning structs and update the options > diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md > index 9b282f1..aab3e00 100644 > --- a/gcc/config/aarch64/aarch64.md > +++ b/gcc/config/aarch64/aarch64.md > @@ -4683,7 +4683,16 @@ > [(set_attr "type" "ffarith<s>")] > ) > > -(define_insn "sqrt<mode>2" > +(define_expand "sqrt<mode>2" > + [(set (match_operand:GPF 0 "register_operand") > + (sqrt:GPF (match_operand:GPF 1 "register_operand")))] > + "TARGET_SIMD" > +{ > + if (aarch64_emit_approx_sqrt (operands[0], operands[1], false)) > + DONE; > +}) > + > +(define_insn "*sqrt<mode>2" > [(set (match_operand:GPF 0 "register_operand" "=w") > (sqrt:GPF (match_operand:GPF 1 "register_operand" "w")))] > "TARGET_FLOAT" > diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt > index c637ff4..ffd5540 100644 > --- a/gcc/config/aarch64/aarch64.opt > +++ b/gcc/config/aarch64/aarch64.opt > @@ -151,5 +151,10 @@ PC relative literal loads. > > mlow-precision-recip-sqrt > Common Var(flag_mrecip_low_precision_sqrt) Optimization > -When calculating the reciprocal square root approximation, > -uses one less step than otherwise, thus reducing latency and precision. > +When calculating the approximate reciprocal square root, > +use one less step than otherwise, thus reducing latency and precision. > + > +mlow-precision-sqrt > +Common Var(flag_mlow_precision_sqrt) Optimization > +When calculating the approximate square root, > +use one less step than otherwise, thus reducing latency and precision. Do we really want separate options for each of these optimizations? Could they all be wrapped under one -mlow-precision-fp or similar? I can see benefits to either approach, but let me know your thoughts. Thanks, James