[PATCH 00/11] Add FP overloads for __atomic_fetch_add etc

mmalcomson Thu, 14 Nov 2024 05:57:52 -0800

From: Matthew Malcomson <mmalcom...@nvidia.com>

Hello,


This is the revision of the RFC posted here:
https://gcc.gnu.org/pipermail/gcc-patches/2024-September/663355.html

This patchset introduces floating point versions of atomic fetch_add,
fetch_sub, add_fetch and sub_fetch.  Instructions for performing these
operations have been directly available in GPU hardware for a while, and
are now starting to get added to CPU ISA's with instructions like the
AArch64 LDFADD.  Clang has allowed floating point types to be used with
these builtins for a while now https://reviews.llvm.org/D71726.

Introducing these new overloads to this builtin allows users to directly
specify the operation needed and hence allows the compiler to provide
optimised output if possible.

There is additional motivation to use such floating point type atomic
operations in libstdc++ so that other compilers can use libstdc++ to
generate optimal code for their own targets (e.g. NVC++ can use
libstdc++ atomic<float>::fetch_add to generate optimal code for GPU's
when using the `-stdpar` argument).  A patch implementing that has been
sent to the libstdc++ mailing list here:
https://gcc.gnu.org/pipermail/libstdc++/2024-October/059865.html

Uses of such builtins in libstdc++ doesn't technically need introduction
of the builtins to GCC -- just ensuring that SFINAE techniques can be used
with overloaded builtins (as is done in this patch:
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665982.html).
That said it is still worthwhile to add these builtins -- especially now
that a primary target will be adding these operations.
N.b. I include an update to that libstdc++ patch in this patch series
because I noticed the ChangeLog in the cover letter had spaces instead
of tabs.

In the original RFC I suggested that libstdc++ use preprocessor checks
along the lines of `__has_builtin(__atomic_fetch_add_fp)` to determine
whether to use the new builtins or a CAS loop written inline.  After asking
the clang community I found out that such resolved versions of the atomic
builtins are not exposed by clang to the user.  They suggested the use of
SFINAE to determine if a given type works with the `__atomic_fetch_add`
builtin.  That's why I have posted the SFINAE patch above and chosen this
approach.

------------------------------
As standard with the existing atomic builtins, we add the same functions
in libatomic, allowing a fallback for when a given target has not
implemented these operations directly in hardware.  In order to use
these functions we need to have a naming scheme that encodes the type --
we use a suffix of _fp to denote that this operation is on a floating
point type, and a further empty suffix to denote a double, 'f' to denote
a float, and similar.  The scheme for the second part of the suffix
taken from the existing builtins that have different versions for
different floating point types -- e.g.  __builtin_acosh,
__builtin_acoshf, __builtin_acoshl, etc.

In order to add floating point functions to libatomic we updated the
makefile machinery to use names more descriptive of the new setup (where
the SIZE of the datatype can no longer be used to distinguish all
operations from each other).  Moreover we add a CAS loop implementation
in fop_n.c that handles floating point exception information and handles
casting between floating point and integral types when switching between
applying the operation and using CAS to attempt to store.

------------------------------
As Joseph Myers pointed out in response to my RFC, when performing
floating point operations in a CAS loop there is the floating point
exception information to take care of.  In order to take care of this
information I use the existing `atomic_assign_expand_fenv` target hook
to generate code that checks this information.

Partly due to the fact that this hook emits GENERIC code and partly due
to the language-specific semantics of floating point exceptions, this
means we now decide whether to emit a CAS loop handling the frontend
(during overload resolution).  The frontend decides to only use the
underlying builtin if the backend has an optab defined that can
implement it directly.

------------------------------
Now that the expansion to a CAS loop is performed in overloaded builtin
resolution, this means that if the user were to directly use a resolved
version (e.g. `__atomic_fetch_add_fp` for a double) that would not
expand into a CAS loop inline.  Instead (assuming the optab is not
implemented for this target) it would pass through and end up using the
libatomic fallback.

This is not ideal, but I believe the complexity of adding another clause
for this expansion to a CAS loop is not worth the benefit of handling a
CAS loop expansion for this specific case (partly on the assumption that
users would rarely specify the resolved version and partly on the belief
that these resolved versions are not actually part of the user-facing
interface -- since they're not documented in the manual and don't seem
to be used enough for clang to expose the interface).

I considered not exposing the resolved versions to the user (similar to
the interface that _BitInt exposes) and instead handling them as an
internal function that could expand to call the libatomic
implementation.  I chose not to do that for consistency with the rest of
the atomic builtins.

------------------------------
There are a few places throughout the compiler that handle such atomic
builtins and I have not updated to handle floating point atomic
builtins.  Places like asan, tsan, gimple-ssa-warn-access, analyzer, and
tree-ssa-forwprop would need to be updated eventually.
However since the current state of GCC is that no backend implements
these optabs directly the generic version of the builtin is always
expanded as a CAS loop in the frontend -- this means these mid-end
passes will not see any of these builtins except in the case that the
user explicitly calls the resolved version.
I hoping to update these places in a later patch.

------------------------------
N.b. there is a remaining bug where ix86_atomic_assign_expand_fenv
generates code that gets broken during optimisation by `fold`.  I
believe the code returned by this function is incorrect (maybe only for
C++?).  Hoping someone can confirm that.
The expression that gets incorrectly optimised is along the lines of:
COMPOUND_EXPR<TARGET_EXPR<var1, some-init>,
              TARGET_EXPR<var2, expression-with-var1>>
and `fold` (which gets called by `cp_fold`) removes the first
TARGET_EXPR since it doesn't look like it has side effects (even though
the variable it sets is used in the second expression).  I trialled
setting TREE_SIDE_EFFECTS in the hook and that appeared to work but
would appreciate feedback on whether that's the "correct" fix or not
(and if so confirmation that all relevant hooks should do something
similar).
Due to this currently C++ code using this builtin fails on x86_64.
libstdc++-v3/testsuite/29_atomics/atomic_float/compare_exchange_padding.cc

------------------------------
N.b. the story around C++ is not quite finished yet.  Automatic linking
in of libatomic is being worked on and we hope to send something
upstream this week.  Similarly the target hooks that were written for C
seem to be problematic for C++ (as described in the point above).
The goal is still to include this feature (after having fixed the
problem outlined above) and having automatic linking of libatomic.
However, I have a few fallback approaches that I'd like to gather
opinions on (with the goal of those fallbacks to upstream the libstdc++
changes that I linked earlier in this message):
  1) Disable fenv exception handling in the builtin for C++ code for this
     release.  (libstdc++ does not currently hande fenv exception, so
     this would not be a regression in that code.  I don't think adding
     exception handling in later would be backwards incompatible).
  2) Include the builtin-SFINAE patch and the libstdc++ patch (both
     already upstreamed) into this release knowing that GCC itself will
     not use this codepath.

------------------------------
Testing done:
  Bootstrap and regression test passes (with known exceptions of C++ on
  x86_64, and libatomic linking in general) on x86_64 and AArch64.
  Cross compiler regression tests pass on arm-linux.
  Also tested with the compiler installed to check behaviour when
  libatomic is in the link path (to check that known failures from
  libatomic linking go away when it's available).
  Similarly tested with a dummy implementation of fetch_add as an optab
  in the AArch64 backend to ensure that codepath also works.

------------------------------

Matthew Malcomson (11):
  libatomic: Split concept of SUFFIX and SIZE in libatomic
  libatomic: Add floating point implementations of fetch_{add,sub}
  c: c++: Define new floating point builtin fetch_add functions
  builtins: Add FP types for atomic builtin overload resolution
  c: c++: Expand into CAS loop in frontend
  builtins: optab: Tie the new atomic builtins to the backend
  testsuite: Add tests for fp resolutions of __atomic_fetch_add
  c: c++: flag to disable fetch_op handling fenv exceptions
  doc: Mention floating point atomic fetch_add etc in docs
  [Not For Commit] Add demo implementation of one of the operations
  [Not For Commit] Link libatomic into tests when available

 gcc/builtin-types.def                         |   20 +
 gcc/builtins.cc                               |  176 ++
 gcc/builtins.h                                |    2 +
 gcc/c-family/c-common.cc                      |  205 ++-
 gcc/c-family/c.opt                            |    6 +
 gcc/config/aarch64/aarch64.h                  |    2 +
 gcc/config/aarch64/aarch64.opt                |    5 +
 gcc/config/aarch64/atomics.md                 |   15 +
 gcc/doc/extend.texi                           |   10 +
 gcc/fortran/f95-lang.cc                       |    5 +
 gcc/fortran/types.def                         |   17 +
 gcc/optabs.cc                                 |   19 +
 gcc/optabs.def                                |    6 +-
 gcc/sync-builtins.def                         |   40 +
 .../template/builtin-atomic-overloads.def     |   28 +-
 .../template/builtin-atomic-overloads6.C      |   23 +-
 .../template/builtin-atomic-overloads7.C      |   16 +-
 gcc/testsuite/gcc.dg/atomic-op-fp-convert.c   |    6 +
 gcc/testsuite/gcc.dg/atomic-op-fp-errs.c      |   14 +
 gcc/testsuite/gcc.dg/atomic-op-fp-fenv-flag.c |   44 +
 .../gcc.dg/atomic-op-fp-resolve-complain.c    |    5 +
 gcc/testsuite/gcc.dg/atomic-op-fp.c           |  200 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf.c          |  200 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf128.c       |  203 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf16.c        |  203 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf16b.c       |  203 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf32.c        |  203 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf32x.c       |  203 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf64.c        |  203 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf64x.c       |  203 +++
 gcc/testsuite/gcc.dg/atomic-op-fpl.c          |  200 +++
 .../gcc.dg/atomic/atomic-op-fp-fenv.c         |  376 +++++
 .../gcc.target/i386/excess-precision-13.c     |   88 +
 gcc/testsuite/lib/target-supports.exp         |  199 ++-
 libatomic/Makefile.am                         |   38 +-
 libatomic/Makefile.in                         |   44 +-
 libatomic/acinclude.m4                        |   56 +-
 libatomic/auto-config.h.in                    |  114 +-
 libatomic/cas_n.c                             |    8 +-
 libatomic/config/linux/aarch64/host-config.h  |   23 +-
 libatomic/config/linux/arm/host-config.h      |    2 +-
 libatomic/config/s390/cas_n.c                 |    6 +-
 libatomic/config/s390/exch_n.c                |    4 +-
 libatomic/config/s390/load_n.c                |    4 +-
 libatomic/config/s390/store_n.c               |    4 +-
 libatomic/config/x86/host-config.h            |   14 +-
 libatomic/configure                           | 1485 ++++++++++++++++-
 libatomic/configure.ac                        |    6 +
 libatomic/exch_n.c                            |   12 +-
 libatomic/fadd_n.c                            |   23 +-
 libatomic/fop_n.c                             |  111 +-
 libatomic/fsub_n.c                            |   23 +
 libatomic/libatomic.map                       |   44 +
 libatomic/libatomic_i.h                       |  186 ++-
 libatomic/load_n.c                            |   12 +-
 libatomic/store_n.c                           |   12 +-
 libatomic/tas_n.c                             |   12 +-
 libatomic/testsuite/Makefile.in               |    1 +
 .../testsuite/libatomic.c/atomic-op-fp-fenv.c |  421 +++++
 .../testsuite/libatomic.c/atomic-op-fp.c      |  219 +++
 .../testsuite/libatomic.c/atomic-op-fpf.c     |  219 +++
 .../testsuite/libatomic.c/atomic-op-fpf128.c  |  220 +++
 .../testsuite/libatomic.c/atomic-op-fpf16.c   |  223 +++
 .../testsuite/libatomic.c/atomic-op-fpf16b.c  |  220 +++
 .../testsuite/libatomic.c/atomic-op-fpf32.c   |  220 +++
 .../testsuite/libatomic.c/atomic-op-fpf32x.c  |  220 +++
 .../testsuite/libatomic.c/atomic-op-fpf64.c   |  220 +++
 .../testsuite/libatomic.c/atomic-op-fpf64x.c  |  220 +++
 .../testsuite/libatomic.c/atomic-op-fpl.c     |  219 +++
 .../testsuite/29_atomics/atomic_float/1.cc    |    1 +
 .../testsuite/29_atomics/atomic_ref/float.cc  |    1 +
 71 files changed, 7978 insertions(+), 237 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fp-convert.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fp-errs.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fp-fenv-flag.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fp-resolve-complain.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fp.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf128.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf16.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf16b.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf32.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf32x.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf64.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf64x.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpl.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic/atomic-op-fp-fenv.c
 create mode 100644 gcc/testsuite/gcc.target/i386/excess-precision-13.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fp-fenv.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fp.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf128.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf16.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf16b.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf32.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf32x.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf64.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf64x.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpl.c

-- 
2.43.0

[PATCH 00/11] Add FP overloads for __atomic_fetch_add etc

Reply via email to