Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible

Jeff Law Fri, 16 Sep 2016 12:49:50 -0700

On 09/12/2016 10:19 AM, Tamar Christina wrote:

Hi All,


This patch adds an optimized route to the fpclassify builtin
for floating point numbers which are similar to IEEE-754 in format.

The goal is to make it faster by:
1. Trying to determine the most common case first
   (e.g. the float is a Normal number) and then the
   rest. The amount of code generated at -O2 are
   about the same ± 1 instruction, but the code
   is much better.
2. Using integer operation in the optimized path.

At a high level, the optimized path uses integer operations
to perform the following:

  if (exponent bits aren't all set or unset)
     return Normal;
  else if (no bits are set on the number after masking out
           sign bits then)
     return Zero;
  else if (exponent has no bits set)
     return Subnormal;
  else if (mantissa has no bits set)
     return Infinite;
  else
     return NaN;

In case the optimization can't be applied the old
implementation is used as a fall-back.

A limitation with this new approach is that the exponent
of the floating point has to fit in 31 bits and the floating
point has to have an IEEE like format and values for NaN and INF
(e.g. for NaN and INF all bits of the exp must be set).

To determine this IEEE likeness a new boolean was added to real_format.

Regression tests ran on aarch64-none-linux and arm-none-linux-gnueabi
and no regression. x86 uses it's own implementation other than
the fpclassify builtin.

As an example, Aarch64 now generates for classification of doubles:

f:
        fmov    x1, d0
        mov     w0, 7
        sbfx    x2, x1, 52, 11
        add     w3, w2, 1
        tst     w3, 0x07FE
        bne     .L1
        mov     w0, 13
        tst     x1, 0x7fffffffffffffff
        beq     .L1
        mov     w0, 11
        tbz     x2, 0, .L1
        tst     x1, 0xfffffffffffff
        mov     w0, 3
        mov     w1, 5
        csel    w0, w0, w1, ne

.L1:
        ret

No new tests as there are existing tests to test functionality.
glibc benchmarks ran against the builtin and this shows a 31.3%
performance gain.

Ok for trunk?

Thanks,
Tamar

PS. I don't have commit rights so if OK can someone apply the patch for me.

gcc/
2016-08-25  Tamar Christina  <tamar.christ...@arm.com>
            Wilco Dijkstra  <wilco.dijks...@arm.com>

        * gcc/builtins.c (fold_builtin_fpclassify): Added optimized version.
        * gcc/real.h (real_format): Added is_ieee_compatible field.
        * gcc/real.c (ieee_single_format): Set is_ieee_compatible flag.
        (mips_single_format): Likewise.
        (motorola_single_format): Likewise.
        (spu_single_format): Likewise.
        (ieee_double_format): Likewise.
        (mips_double_format): Likewise.
        (motorola_double_format): Likewise.
        (ieee_extended_motorola_format): Likewise.
        (ieee_extended_intel_128_format): Likewise.
        (ieee_extended_intel_96_round_53_format): Likewise.
        (ibm_extended_format): Likewise.
        (mips_extended_format): Likewise.
        (ieee_quad_format): Likewise.
        (mips_quad_format): Likewise.
        (vax_f_format): Likewise.
        (vax_d_format): Likewise.
        (vax_g_format): Likewise.
        (decimal_single_format): Likewise.
        (decimal_quad_format): Likewise.
        (iee_half_format): Likewise.
        (mips_single_format): Likewise.
        (arm_half_format): Likewise.
        (real_internal_format): Likewise.


gcc-public.patch


diff --git a/gcc/builtins.c b/gcc/builtins.c
index 
1073e35b17b1bc1f6974c71c940bd9d82bbbfc0f..58bf129f9a0228659fd3b976d38d021d1d5bd6bb
 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -7947,10 +7947,8 @@ static tree
 fold_builtin_fpclassify (location_t loc, tree *args, int nargs)
 {
   tree fp_nan, fp_infinite, fp_normal, fp_subnormal, fp_zero,
-    arg, type, res, tmp;
+    arg, type, res;
   machine_mode mode;
-  REAL_VALUE_TYPE r;
-  char buf[128];

   /* Verify the required arguments in the original call.  */
   if (nargs != 6
@@ -7970,14 +7968,143 @@ fold_builtin_fpclassify (location_t loc, tree *args, 
int nargs)
   arg = args[5];
   type = TREE_TYPE (arg);
   mode = TYPE_MODE (type);
-  arg = builtin_save_expr (fold_build1_loc (loc, ABS_EXPR, type, arg));
+  const real_format *format = REAL_MODE_FORMAT (mode);
+
+  /*
+  For IEEE 754 types:
+
+  fpclassify (x) ->
+       !((exp + 1) & (exp_mask & ~1)) // exponent bits not all set or unset
+        ? (x & sign_mask == 0 ? FP_ZERO :
+          (exp & exp_mask == exp_mask
+             ? (mantisa == 0 ? FP_INFINITE : FP_NAN) :
+             FP_SUBNORMAL)):
+       FP_NORMAL.
+
+  Otherwise
+
+  fpclassify (x) ->
+       isnan (x) ? FP_NAN :
+       (fabs (x) == Inf ? FP_INFINITE :
+          (fabs (x) >= DBL_MIN ? FP_NORMAL :
+            (x == 0 ? FP_ZERO : FP_SUBNORMAL))).
+  */
+
+  /* Check if the number that is being classified is close enough to IEEE 754
+     format to be able to go in the early exit code.  */
+  if (format->is_binary_ieee_compatible)
+    {
+      gcc_assert (format->b == 2);
+
+      const tree int_type = integer_type_node;
+      const int exp_bits  = (GET_MODE_SIZE (mode) * BITS_PER_UNIT) - format->p;
+      const int exp_mask  = (1 << exp_bits) - 1;
+
+      tree exp, specials, exp_bitfield,
+          const_arg0, const_arg1, const0, const1,
+          not_sign_mask, zero_check, mantissa_mask,
+          mantissa_any_set, exp_lsb_set, mask_check;
+      tree int_arg_type, int_arg;

Style nit.  Just use

  tree exp, specials, exp_bitfield;
  tree const_arg0, const_arg1, etc etc.

+
+      /* Re-interpret the float as an unsigned integer type
+        with equal precision.  */
+      int_arg_type = build_nonstandard_integer_type (TYPE_PRECISION (type), 0);
+      int_arg = fold_build1_loc (loc, INDIRECT_REF, int_arg_type,
+                 fold_build1_loc (loc, NOP_EXPR,
+                                  build_pointer_type (int_arg_type),
+                   fold_build1_loc (loc, ADDR_EXPR,
+                                    build_pointer_type (type), arg)));

Doesn't this make ARG addressable? Which in turn means ARG won't beexposed to the gimple/ssa optimizers. Or is it the case that whenfpclassify is used its argument is already in memory (and thus addressable?)

+      /* ~(1 << location_sign_bit).
+        This creates a mask that can be used to mask out the sign bit.  */
+      not_sign_mask = fold_build1_loc (loc, BIT_NOT_EXPR, int_arg_type,
+                       fold_build2_loc (loc, LSHIFT_EXPR, int_arg_type,
+                         const_arg1,
+                         build_int_cst (int_arg_type, format->signbit_rw)));

Formatting nits. When you have to wrap a call, the arguments areformatted like this


foo (arg, arg, arg, ...
     arg, arg

Given you've got calls to fold_build2_loc, build_int_cst, etc embeddedinside other calls to fold_build2_loc, I'd just create some temporariesto hold the results of the inner calls. That'll clean up the formattingsignificantly.

+                                            exp, const1));
+
+      /* Combine the values together.  */
+      specials = fold_build3_loc (loc, COND_EXPR, int_type, zero_check, 
fp_zero,
+                  fold_build3_loc (loc, COND_EXPR, int_type, exp_lsb_set,
+                   fold_build3_loc (loc, COND_EXPR, int_type, mantissa_any_set,
+                     HONOR_NANS (mode) ? fp_nan : fp_normal,
+                     HONOR_INFINITIES (mode) ? fp_infinite : fp_normal),
+                   fp_subnormal));

So this implies you're running on generic, not gimple, right? Otherwiseyou can't generate these kinds of expressions.

diff --git a/gcc/real.h b/gcc/real.h
index 
59af580e78f2637be84f71b98b45ec6611053222..36ded57cf4db7c30c935bdb24219a167480f39c8
 100644
--- a/gcc/real.h
+++ b/gcc/real.h
@@ -161,6 +161,15 @@ struct real_format
   bool has_signed_zero;
   bool qnan_msb_set;
   bool canonical_nan_lsbs_set;
+
+  /* This flag indicates whether the format can be used in the optimized
+     code paths for the __builtin_fpclassify function and friends.
+     The format has to have the same NaN and INF representation as normal
+     IEEE floats (e.g. exp must have all bits set), most significant bit must 
be
+     sign bit, followed by exp bits of at most 32 bits.  Lastly the floating
+     point number must be representable as an integer.  The base of the number
+     also must be base 2.  */
+  bool is_binary_ieee_compatible;
   const char *name;
 };

I think Joseph has already commented on the contents of the initializerand a few more cases were we can use the optimized paths.

However, I do have a general question. There are some targets whichhave FPUs that are basically IEEE, but don't support certain IEEEfeatures like NaNs, denorms, etc.

Presumably all that's needed is for those targets to define a hook todescribe which checks will always be false and you can check the hook'sreturn value. Right?

Can you please include some tests to verify you're getting the initialcode generation you want? Ideally there'd be execution tests too whereyou generate one of the special nodes, then call the __builtin andverify that you get the expected results back. The latter in particularare key since it'll allow us to catch problems much earlier across thewide variety of targets GCC supports.

I think you already had plans to post an updated patch. Please includethe fixes noted above in that update.

And just to be clear, I like where this is going, I just think we'regoing to need a couple iterations to iron everything out.


Jeff

Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible

Reply via email to