[PATCH 06/11] builtins: optab: Tie the new atomic builtins to the backend

mmalcomson Thu, 14 Nov 2024 05:58:13 -0800

From: Matthew Malcomson <mmalcom...@nvidia.com>

Not sure who to Cc for this.  Honestly just guessing a bit here.  Please
do redirect me if anyone knows of a better set of people to ask.


-------------- >8 ------- 8< -----------
Update the optabs definitions to include floating point versions of
atomic fetch_add variants.

Things to highlight to any reviewer:
1) I do not always expand into the modify before version, but also use
   the modify after version when unable to inline.
   - From looking at the dates on different parts of the code, it seems
     that this used to be needed before libatomic was added as a target
     library.  Since libatomic currently implements both the fetch_<op>
     and <op>_fetch versions I don't believe it's needed any more.

2) We do not expand into a CAS loop in the backend (contrary to fetch_op
   operations on other types).
   At the same time add an assertion that we never hit the clause for
   using a CAS loop at expand time if we have a floating point mode.

   The reason for this is that in the expansion of a CAS loop for
   floating point operations, one needs to handle what to do with
   floating point exception information.  Exactly what is necessary may
   depend on the frontend.  As it stands the only frontend that handles
   floating point fetch_op operations is the C/C++ frontend.  These only
   emit the direct atomic function if there is a floating point
   implementation of any of the fetch_add family of optabs.  If there is
   one of this family available then the expand pass should be able to
   avoid a CAS loop, if there is none of this family of optabs available
   then the frontend would have emitted its own CAS loop accounting for
   floating point exception information.

   If floating point fetch_op operations are added to any other frontend
   that frontend will need to handle the floating point exception
   behaviour for their language.  If there is essentially nothing to
   do then it may be worth adjusting the expand pass to allow expanding
   a CAS loop in a floating point mode while adding a hook into the
   frontends that queries "should the expand pass be replacing a
   floating point fetch_op with a CAS loop".

   N.b. the current requirement does mean that if there is an optab for
   a floating point fetch_add then the optab must be able to take any
   floating point value.
   This is because the frontend checks whether optabs exist, and based
   on that decides whether to expand as a CAS loop or not.

gcc/ChangeLog:

        * builtins.cc (get_builtin_fp_sync_mode): New.
        (expand_builtin): Handle new floating point fetch_add/fetch_sub
        and add_fetch/sub_fetch.
        * optabs.cc (expand_atomic_fetch_op): Add assertion that we
        never use compare and swap on a floating point type.
        * optabs.def: Add floating point type optabs for existing
        fetch_add/fetch_sub/add_fetch/sub_fetch operations.

Signed-off-by: Matthew Malcomson <mmalcom...@nvidia.com>
---
 gcc/builtins.cc | 95 +++++++++++++++++++++++++++++++++++++++++++++++++
 gcc/optabs.cc   | 19 ++++++++++
 gcc/optabs.def  |  6 +++-
 3 files changed, 119 insertions(+), 1 deletion(-)

diff --git a/gcc/builtins.cc b/gcc/builtins.cc
index 3585c5c9af9..7807f6d7c90 100644
--- a/gcc/builtins.cc
+++ b/gcc/builtins.cc
@@ -6436,6 +6436,19 @@ get_builtin_fp_offset (struct type_to_repr_builtin x)
   return x.example - BUILT_IN_ATOMIC_FETCH_ADD_N;
 }
 
+static inline machine_mode
+get_builtin_fp_sync_mode (int fcode_diff)
+{
+  const struct type_to_repr_builtin *tto_p;
+  size_t len = fp_type_mappings (&tto_p);
+  for (size_t i = 0; i < len; i++)
+    {
+      if ((size_t) fcode_diff == get_builtin_fp_offset (tto_p[i]))
+       return TYPE_MODE (tto_p[i].type);
+    }
+  gcc_unreachable ();
+}
+
 int
 get_builtin_fp_offset (tree type)
 {
@@ -8921,6 +8934,88 @@ expand_builtin (tree exp, rtx target, rtx subtarget, 
machine_mode mode,
        return target;
       break;
 
+    case BUILT_IN_ATOMIC_FETCH_ADD_FPF:
+    case BUILT_IN_ATOMIC_FETCH_ADD_FP:
+    case BUILT_IN_ATOMIC_FETCH_ADD_FPL:
+    case BUILT_IN_ATOMIC_FETCH_ADD_FPF16B:
+    case BUILT_IN_ATOMIC_FETCH_ADD_FPF16:
+    case BUILT_IN_ATOMIC_FETCH_ADD_FPF32:
+    case BUILT_IN_ATOMIC_FETCH_ADD_FPF64:
+    case BUILT_IN_ATOMIC_FETCH_ADD_FPF128:
+    case BUILT_IN_ATOMIC_FETCH_ADD_FPF32X:
+    case BUILT_IN_ATOMIC_FETCH_ADD_FPF64X:
+      mode = get_builtin_fp_sync_mode (fcode - BUILT_IN_ATOMIC_FETCH_ADD_N);
+      target = expand_builtin_atomic_fetch_op (mode, exp, target, PLUS, false,
+                                              ignore, BUILT_IN_NONE);
+      if (target)
+       return target;
+      break;
+
+    case BUILT_IN_ATOMIC_ADD_FETCH_FPF:
+    case BUILT_IN_ATOMIC_ADD_FETCH_FP:
+    case BUILT_IN_ATOMIC_ADD_FETCH_FPL:
+    case BUILT_IN_ATOMIC_ADD_FETCH_FPF16B:
+    case BUILT_IN_ATOMIC_ADD_FETCH_FPF16:
+    case BUILT_IN_ATOMIC_ADD_FETCH_FPF32:
+    case BUILT_IN_ATOMIC_ADD_FETCH_FPF64:
+    case BUILT_IN_ATOMIC_ADD_FETCH_FPF128:
+    case BUILT_IN_ATOMIC_ADD_FETCH_FPF32X:
+    case BUILT_IN_ATOMIC_ADD_FETCH_FPF64X:
+      mode = get_builtin_fp_sync_mode (fcode - BUILT_IN_ATOMIC_ADD_FETCH_N);
+      /* TODO I don't translate to the FETCH_ADD library call if this fails
+        to inline.  The integral ADD_FETCH versions of atomic functions do.
+        I don't understand why they make that transformation, could *guess*
+        that it's the more likely function to be implemented except that
+        libatomic seems to implement everything if it implements anything.
+        -- Any explanation why the integral versions make this translation
+        (and hence help with whether these floating point versions should make
+        that translation) would be welcomed.
+
+        A comment in gcc.dg/atomic-noinline.c seems to imply that such a
+        translation was necessary at one point.  That comment was added to the
+        testsuite file before the introduction of libatomic to the GCC target
+        library.  I guess this was something needed in an earlier state of the
+        ecosystem.  */
+      target = expand_builtin_atomic_fetch_op (mode, exp, target, PLUS, true,
+                                              ignore, BUILT_IN_NONE);
+      if (target)
+       return target;
+      break;
+
+    case BUILT_IN_ATOMIC_FETCH_SUB_FPF:
+    case BUILT_IN_ATOMIC_FETCH_SUB_FP:
+    case BUILT_IN_ATOMIC_FETCH_SUB_FPL:
+    case BUILT_IN_ATOMIC_FETCH_SUB_FPF16B:
+    case BUILT_IN_ATOMIC_FETCH_SUB_FPF16:
+    case BUILT_IN_ATOMIC_FETCH_SUB_FPF32:
+    case BUILT_IN_ATOMIC_FETCH_SUB_FPF64:
+    case BUILT_IN_ATOMIC_FETCH_SUB_FPF128:
+    case BUILT_IN_ATOMIC_FETCH_SUB_FPF32X:
+    case BUILT_IN_ATOMIC_FETCH_SUB_FPF64X:
+      mode = get_builtin_fp_sync_mode (fcode - BUILT_IN_ATOMIC_FETCH_SUB_N);
+      target = expand_builtin_atomic_fetch_op (mode, exp, target, MINUS, false,
+                                              ignore, BUILT_IN_NONE);
+      if (target)
+       return target;
+      break;
+
+    case BUILT_IN_ATOMIC_SUB_FETCH_FPF:
+    case BUILT_IN_ATOMIC_SUB_FETCH_FP:
+    case BUILT_IN_ATOMIC_SUB_FETCH_FPL:
+    case BUILT_IN_ATOMIC_SUB_FETCH_FPF16B:
+    case BUILT_IN_ATOMIC_SUB_FETCH_FPF16:
+    case BUILT_IN_ATOMIC_SUB_FETCH_FPF32:
+    case BUILT_IN_ATOMIC_SUB_FETCH_FPF64:
+    case BUILT_IN_ATOMIC_SUB_FETCH_FPF128:
+    case BUILT_IN_ATOMIC_SUB_FETCH_FPF32X:
+    case BUILT_IN_ATOMIC_SUB_FETCH_FPF64X:
+      mode = get_builtin_fp_sync_mode (fcode - BUILT_IN_ATOMIC_SUB_FETCH_N);
+      target = expand_builtin_atomic_fetch_op (mode, exp, target, MINUS, true,
+                                              ignore, BUILT_IN_NONE);
+      if (target)
+       return target;
+      break;
+
     case BUILT_IN_ATOMIC_TEST_AND_SET:
       target = expand_builtin_atomic_test_and_set (exp, target);
       if (target)
diff --git a/gcc/optabs.cc b/gcc/optabs.cc
index 03ef0c5d81d..4747b5e5732 100644
--- a/gcc/optabs.cc
+++ b/gcc/optabs.cc
@@ -7877,6 +7877,25 @@ expand_atomic_fetch_op (rtx target, rtx mem, rtx val, 
enum rtx_code code,
   /* If nothing else has succeeded, default to a compare and swap loop.  */
   if (can_compare_and_swap_p (mode, true))
     {
+      /* N.b. Currently don't implement expanding a floating point fetch_<op>
+        operation into a compare and swap loop.  This is not really
+        problematic, except that in at least the C/C++ frontends there needs
+        to be extra handling of floating point exceptions to ensure the
+        remaining exception information refers to the operation that succeeded
+        in the CAS rather than one earlier.
+
+        Other frontends have nothing to handle this case.  Until the work is
+        done the choices we could make are to either allow the CAS expansion
+        ignoring this case (silently generate code that is correct or not
+        based on language semantics) or to not implement anything (always
+        avoid generating compare and swap).
+
+        The second seems the safer option, hence we are simply asserting that
+        the current clause is never entered with a floating point mode.
+        N.b. this requires the backend atomic floating point fetch_add (or
+        variant) optab to allow any value for the value that is getting added.
+        I.e. requires accepting a register or memory location.  */
+      gcc_assert (!SCALAR_FLOAT_MODE_P (mode));
       rtx_insn *insn;
       rtx t0 = gen_reg_rtx (mode), t1;
 
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 75f39d85ada..ababb653bfb 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -507,6 +507,7 @@ OPTAB_D (sync_sub_optab, "sync_sub$I$a")
 OPTAB_D (sync_xor_optab, "sync_xor$I$a")
 
 OPTAB_D (atomic_add_fetch_optab, "atomic_add_fetch$I$a")
+OPTAB_NX (atomic_add_fetch_optab, "atomic_add_fetch$F$a")
 OPTAB_D (atomic_add_optab, "atomic_add$I$a")
 OPTAB_D (atomic_and_fetch_optab, "atomic_and_fetch$I$a")
 OPTAB_D (atomic_and_optab, "atomic_and$I$a")
@@ -515,11 +516,13 @@ OPTAB_D (atomic_bit_test_and_complement_optab, 
"atomic_bit_test_and_complement$I
 OPTAB_D (atomic_bit_test_and_reset_optab, "atomic_bit_test_and_reset$I$a")
 OPTAB_D (atomic_compare_and_swap_optab, "atomic_compare_and_swap$I$a")
 OPTAB_D (atomic_exchange_optab,         "atomic_exchange$I$a")
-OPTAB_D (atomic_fetch_add_optab, "atomic_fetch_add$I$a")
+OPTAB_D (atomic_fetch_add_optab, "atomic_fetch_add$F$a")
+OPTAB_NX (atomic_fetch_add_optab, "atomic_fetch_add$I$a")
 OPTAB_D (atomic_fetch_and_optab, "atomic_fetch_and$I$a")
 OPTAB_D (atomic_fetch_nand_optab, "atomic_fetch_nand$I$a")
 OPTAB_D (atomic_fetch_or_optab, "atomic_fetch_or$I$a")
 OPTAB_D (atomic_fetch_sub_optab, "atomic_fetch_sub$I$a")
+OPTAB_NX (atomic_fetch_sub_optab, "atomic_fetch_sub$F$a")
 OPTAB_D (atomic_fetch_xor_optab, "atomic_fetch_xor$I$a")
 OPTAB_D (atomic_load_optab, "atomic_load$I$a")
 OPTAB_D (atomic_nand_fetch_optab, "atomic_nand_fetch$I$a")
@@ -528,6 +531,7 @@ OPTAB_D (atomic_or_fetch_optab, "atomic_or_fetch$I$a")
 OPTAB_D (atomic_or_optab, "atomic_or$I$a")
 OPTAB_D (atomic_store_optab, "atomic_store$I$a")
 OPTAB_D (atomic_sub_fetch_optab, "atomic_sub_fetch$I$a")
+OPTAB_NX (atomic_sub_fetch_optab, "atomic_sub_fetch$F$a")
 OPTAB_D (atomic_sub_optab, "atomic_sub$I$a")
 OPTAB_D (atomic_xor_fetch_optab, "atomic_xor_fetch$I$a")
 OPTAB_D (atomic_xor_optab, "atomic_xor$I$a")
-- 
2.43.0

[PATCH 06/11] builtins: optab: Tie the new atomic builtins to the backend

Reply via email to