Re: [PATCH 2/3][AArch64] Emit square root using the Newton series

James Greenhalgh Wed, 25 May 2016 08:54:18 -0700

On Wed, Apr 27, 2016 at 04:15:45PM -0500, Evandro Menezes wrote:
>    gcc/
>         * config/aarch64/aarch64-protos.h
>         (aarch64_emit_approx_rsqrt): Replace with new function
>         "aarch64_emit_approx_sqrt".
>         (tune_params): New member "approx_sqrt_modes".
>         * config/aarch64/aarch64.c
>         (generic_tunings): New member "approx_rsqrt_modes".
>         (cortexa35_tunings): Likewise.
>         (cortexa53_tunings): Likewise.
>         (cortexa57_tunings): Likewise.
>         (cortexa72_tunings): Likewise.
>         (exynosm1_tunings): Likewise.
>         (thunderx_tunings): Likewise.
>         (xgene1_tunings): Likewise.
>         (aarch64_emit_approx_rsqrt): Replace with new function
>         "aarch64_emit_approx_sqrt".
>         (aarch64_override_options_after_change_1): Handle new option.
>         * config/aarch64/aarch64-simd.md
>         (rsqrt<mode>2): Use new function instead.
>         (sqrt<mode>2): New expansion and insn definitions.
>         * config/aarch64/aarch64.md: Likewise.
>         * config/aarch64/aarch64.opt
>         (mlow-precision-sqrt): Add new option description.
>         * doc/invoke.texi (mlow-precision-sqrt): Likewise.
> 
> 
 
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index 50f1d24..437f6af 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -244,6 +244,7 @@ struct tune_params
>    } autoprefetcher_model;
>  
>    unsigned int extra_tuning_flags;
> +  unsigned int approx_sqrt_modes;
>    unsigned int approx_rsqrt_modes;
>  };


This should go in struct recommended in 1/3 too.

>  
> @@ -396,7 +397,7 @@ void aarch64_register_pragmas (void);
>  void aarch64_relayout_simd_types (void);
>  void aarch64_reset_previous_fndecl (void);
>  void aarch64_save_restore_target_globals (tree);
> -void aarch64_emit_approx_rsqrt (rtx, rtx);
> +bool aarch64_emit_approx_sqrt (rtx, rtx, bool);
>  
>  /* Initialize builtins for SIMD intrinsics.  */
>  void init_aarch64_simd_builtins (void);
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 68381bf..589871b 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -38,6 +38,7 @@
>  #include "recog.h"
>  #include "diagnostic.h"
>  #include "insn-attr.h"
> +#include "insn-flags.h"

Can you remember what you needed this include for? I couldn't spot it below
and removing the include didn't seem to cause any trouble to the build.
If it isn't needed, drop it.

>  #include "insn-modes.h"
>  #include "alias.h"
>  #include "fold-const.h"
> @@ -7521,46 +7530,78 @@ get_rsqrts_type (machine_mode mode)
>    }
>  }
>  
> -/* Emit instruction sequence to compute the reciprocal square root using the
> -   Newton-Raphson series.  Iterate over the series twice for SF
> -   and thrice for DF.  */
> +/* Emit instruction sequence to compute either the approximate square root
> +   or its approximate reciprocal.  */

As you are updating this function a new parameter and a return value, please
comment it appropriately. Describing the purpose of the parameter and the
meaning of the return value.

> -void
> -aarch64_emit_approx_rsqrt (rtx dst, rtx src)
> +bool
> +aarch64_emit_approx_sqrt (rtx dst, rtx src, bool recp)
>  {
> -  machine_mode mode = GET_MODE (src);
> -  gcc_assert (
> -    mode == SFmode || mode == V2SFmode || mode == V4SFmode
> -     || mode == DFmode || mode == V2DFmode);
> -
> -  rtx xsrc = gen_reg_rtx (mode);
> -  emit_move_insn (xsrc, src);
> -  rtx x0 = gen_reg_rtx (mode);
> +  machine_mode mode = GET_MODE (dst);
> +  machine_mode mmsk = mode_for_vector (int_mode_for_mode (GET_MODE_INNER 
> (mode)),
> +                                    GET_MODE_NUNITS (mode));
> +
> +  if (!flag_finite_math_only
> +      || flag_trapping_math
> +      || !flag_unsafe_math_optimizations
> +      || optimize_function_for_size_p (cfun)

> +      || !((recp && (flag_mrecip_low_precision_sqrt
> +                  || (aarch64_tune_params.approx_rsqrt_modes
> +                      & AARCH64_APPROX_MODE (mode))))
> +        || (!recp && (flag_mlow_precision_sqrt
> +                      || (aarch64_tune_params.approx_sqrt_modes
> +                          & AARCH64_APPROX_MODE (mode))))))
> +    return false;

Can you pull out these sub-clauses and assign them somewhere for clarity.
These expressions are a bit too big to grok at a distance. I'd rather be
reading:

bool use_low_precision_rsqrt_p
  = recp
     && (flag_mrecip_low_precision_sqrt
         || (aarch64_tune_params.approx_rsqrt_modes
             & AARCH64_APPROX_MODE (mode)))
   <...>

  !((use_low_precision_sqrt_p)
     || (use_low_precision_rsqrt_p))

> -  emit_insn ((*get_rsqrte_type (mode)) (x0, xsrc));
> +  rtx xmsk = gen_reg_rtx (mmsk);
> +  if (!recp)
> +    /* When calculating the approximate square root, compare the argument 
> with
> +       0.0 and create a mask.  */
> +    emit_insn (gen_rtx_SET (xmsk, gen_rtx_NEG (mmsk, gen_rtx_EQ (mmsk, src,
> +                                                       CONST0_RTX (mode)))));
>  
> -  bool double_mode = (mode == DFmode || mode == V2DFmode);
> +  /* Estimate the approximate reciprocal square root.  */
> +  rtx xdst = gen_reg_rtx (mode);
> +  emit_insn ((*get_rsqrte_type (mode)) (xdst, src));
>  
> -  int iterations = double_mode ? 3 : 2;
> +  /* Iterate over the series twice for SF and thrice for DF.  */
> +  int iterations = (GET_MODE_INNER (mode) == DFmode) ? 3 : 2;
>  
> -  /* Optionally iterate over the series one less time than otherwise.  */
> -  if (flag_mrecip_low_precision_sqrt)
> +  /* Optionally iterate over the series once less for faster performance
> +     while sacrificing the accuracy.  */
> +  if ((recp && flag_mrecip_low_precision_sqrt)
> +      || (!recp && flag_mlow_precision_sqrt))
>      iterations--;
>  
> -  for (int i = 0; i < iterations; ++i)
> +  /* Iterate over the series to calculate the approximate reciprocal square 
> root.  */
> +  rtx x1 = gen_reg_rtx (mode);

Why do you generate a register for x1 outside the loop but x2 inside it?
Is it just because you want x1 to survive this scope? It seems you could
either reuse a single x2 register too, or repeatedly generate a new x1
inside the loop too.

> +  while (iterations--)
>      {
> -      rtx x1 = gen_reg_rtx (mode);
>        rtx x2 = gen_reg_rtx (mode);
> -      rtx x3 = gen_reg_rtx (mode);
> -      emit_set_insn (x2, gen_rtx_MULT (mode, x0, x0));
> +      emit_set_insn (x2, gen_rtx_MULT (mode, xdst, xdst));
> +
> +      emit_insn ((*get_rsqrts_type (mode)) (x1, src, x2));
>  
> -      emit_insn ((*get_rsqrts_type (mode)) (x3, xsrc, x2));
> +      if (iterations > 0)
> +     emit_set_insn (xdst, gen_rtx_MULT (mode, xdst, x1));
> +    }
> +
> +  if (!recp)
> +    {
> +      /* Qualify the approximate reciprocal square root when the argument is
> +      0.0 by squashing the intermediary result to 0.0.  */
> +      rtx xtmp = gen_reg_rtx (mmsk);
> +      emit_set_insn (xtmp, gen_rtx_AND (mmsk, gen_rtx_NOT (mmsk, xmsk),
> +                                           gen_rtx_SUBREG (mmsk, xdst, 0)));
> +      emit_move_insn (xdst, gen_rtx_SUBREG (mode, xtmp, 0));
>  
> -      emit_set_insn (x1, gen_rtx_MULT (mode, x0, x3));
> -      x0 = x1;
> +      /* Calculate the approximate square root.  */
> +      emit_set_insn (xdst, gen_rtx_MULT (mode, xdst, src));
>      }
>  
> -  emit_move_insn (dst, x0);
> +  /* Return the approximation.  */
> +  emit_set_insn (dst, gen_rtx_MULT (mode, xdst, x1));
> +
> +  return true;
>  }
>  
>  /* Return the number of instructions that can be issued per cycle.  */
> @@ -8090,6 +8131,12 @@ aarch64_override_options_after_change_1 (struct 
> gcc_options *opts)
>        && (aarch64_cmodel == AARCH64_CMODEL_TINY
>         || aarch64_cmodel == AARCH64_CMODEL_TINY_PIC))
>      aarch64_nopcrelative_literal_loads = false;
> +
> +  /* When enabling the lower precision Newton series for the square root, 
> also
> +     enable it for the reciprocal square root, since the later is an
> +     intermediary step for the latter.  */
> +  if (flag_mlow_precision_sqrt)
> +    flag_mrecip_low_precision_sqrt = true;
>  }
>  
>  /* 'Unpack' up the internal tuning structs and update the options
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 9b282f1..aab3e00 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -4683,7 +4683,16 @@
>    [(set_attr "type" "ffarith<s>")]
>  )
>  
> -(define_insn "sqrt<mode>2"
> +(define_expand "sqrt<mode>2"
> +  [(set (match_operand:GPF 0 "register_operand")
> +        (sqrt:GPF (match_operand:GPF 1 "register_operand")))]
> +  "TARGET_SIMD"
> +{
> +  if (aarch64_emit_approx_sqrt (operands[0], operands[1], false))
> +    DONE;
> +})
> +
> +(define_insn "*sqrt<mode>2"
>    [(set (match_operand:GPF 0 "register_operand" "=w")
>          (sqrt:GPF (match_operand:GPF 1 "register_operand" "w")))]
>    "TARGET_FLOAT"
> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> index c637ff4..ffd5540 100644
> --- a/gcc/config/aarch64/aarch64.opt
> +++ b/gcc/config/aarch64/aarch64.opt
> @@ -151,5 +151,10 @@ PC relative literal loads.
>  
>  mlow-precision-recip-sqrt
>  Common Var(flag_mrecip_low_precision_sqrt) Optimization
> -When calculating the reciprocal square root approximation,
> -uses one less step than otherwise, thus reducing latency and precision.
> +When calculating the approximate reciprocal square root,
> +use one less step than otherwise, thus reducing latency and precision.
> +
> +mlow-precision-sqrt
> +Common Var(flag_mlow_precision_sqrt) Optimization
> +When calculating the approximate square root,
> +use one less step than otherwise, thus reducing latency and precision.

Do we really want separate options for each of these optimizations? Could
they all be wrapped under one -mlow-precision-fp or similar? I can see
benefits to either approach, but let me know your thoughts.

Thanks,
James

Re: [PATCH 2/3][AArch64] Emit square root using the Newton series

Reply via email to