From: Matthew Malcomson <mmalcom...@nvidia.com> Hello,
This is the revision of the RFC posted here: https://gcc.gnu.org/pipermail/gcc-patches/2024-September/663355.html This patchset introduces floating point versions of atomic fetch_add, fetch_sub, add_fetch and sub_fetch. Instructions for performing these operations have been directly available in GPU hardware for a while, and are now starting to get added to CPU ISA's with instructions like the AArch64 LDFADD. Clang has allowed floating point types to be used with these builtins for a while now https://reviews.llvm.org/D71726. Introducing these new overloads to this builtin allows users to directly specify the operation needed and hence allows the compiler to provide optimised output if possible. There is additional motivation to use such floating point type atomic operations in libstdc++ so that other compilers can use libstdc++ to generate optimal code for their own targets (e.g. NVC++ can use libstdc++ atomic<float>::fetch_add to generate optimal code for GPU's when using the `-stdpar` argument). A patch implementing that has been sent to the libstdc++ mailing list here: https://gcc.gnu.org/pipermail/libstdc++/2024-October/059865.html Uses of such builtins in libstdc++ doesn't technically need introduction of the builtins to GCC -- just ensuring that SFINAE techniques can be used with overloaded builtins (as is done in this patch: https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665982.html). That said it is still worthwhile to add these builtins -- especially now that a primary target will be adding these operations. N.b. I include an update to that libstdc++ patch in this patch series because I noticed the ChangeLog in the cover letter had spaces instead of tabs. In the original RFC I suggested that libstdc++ use preprocessor checks along the lines of `__has_builtin(__atomic_fetch_add_fp)` to determine whether to use the new builtins or a CAS loop written inline. After asking the clang community I found out that such resolved versions of the atomic builtins are not exposed by clang to the user. They suggested the use of SFINAE to determine if a given type works with the `__atomic_fetch_add` builtin. That's why I have posted the SFINAE patch above and chosen this approach. ------------------------------ As standard with the existing atomic builtins, we add the same functions in libatomic, allowing a fallback for when a given target has not implemented these operations directly in hardware. In order to use these functions we need to have a naming scheme that encodes the type -- we use a suffix of _fp to denote that this operation is on a floating point type, and a further empty suffix to denote a double, 'f' to denote a float, and similar. The scheme for the second part of the suffix taken from the existing builtins that have different versions for different floating point types -- e.g. __builtin_acosh, __builtin_acoshf, __builtin_acoshl, etc. In order to add floating point functions to libatomic we updated the makefile machinery to use names more descriptive of the new setup (where the SIZE of the datatype can no longer be used to distinguish all operations from each other). Moreover we add a CAS loop implementation in fop_n.c that handles floating point exception information and handles casting between floating point and integral types when switching between applying the operation and using CAS to attempt to store. ------------------------------ As Joseph Myers pointed out in response to my RFC, when performing floating point operations in a CAS loop there is the floating point exception information to take care of. In order to take care of this information I use the existing `atomic_assign_expand_fenv` target hook to generate code that checks this information. Partly due to the fact that this hook emits GENERIC code and partly due to the language-specific semantics of floating point exceptions, this means we now decide whether to emit a CAS loop handling the frontend (during overload resolution). The frontend decides to only use the underlying builtin if the backend has an optab defined that can implement it directly. ------------------------------ Now that the expansion to a CAS loop is performed in overloaded builtin resolution, this means that if the user were to directly use a resolved version (e.g. `__atomic_fetch_add_fp` for a double) that would not expand into a CAS loop inline. Instead (assuming the optab is not implemented for this target) it would pass through and end up using the libatomic fallback. This is not ideal, but I believe the complexity of adding another clause for this expansion to a CAS loop is not worth the benefit of handling a CAS loop expansion for this specific case (partly on the assumption that users would rarely specify the resolved version and partly on the belief that these resolved versions are not actually part of the user-facing interface -- since they're not documented in the manual and don't seem to be used enough for clang to expose the interface). I considered not exposing the resolved versions to the user (similar to the interface that _BitInt exposes) and instead handling them as an internal function that could expand to call the libatomic implementation. I chose not to do that for consistency with the rest of the atomic builtins. ------------------------------ There are a few places throughout the compiler that handle such atomic builtins and I have not updated to handle floating point atomic builtins. Places like asan, tsan, gimple-ssa-warn-access, analyzer, and tree-ssa-forwprop would need to be updated eventually. However since the current state of GCC is that no backend implements these optabs directly the generic version of the builtin is always expanded as a CAS loop in the frontend -- this means these mid-end passes will not see any of these builtins except in the case that the user explicitly calls the resolved version. I hoping to update these places in a later patch. ------------------------------ N.b. there is a remaining bug where ix86_atomic_assign_expand_fenv generates code that gets broken during optimisation by `fold`. I believe the code returned by this function is incorrect (maybe only for C++?). Hoping someone can confirm that. The expression that gets incorrectly optimised is along the lines of: COMPOUND_EXPR<TARGET_EXPR<var1, some-init>, TARGET_EXPR<var2, expression-with-var1>> and `fold` (which gets called by `cp_fold`) removes the first TARGET_EXPR since it doesn't look like it has side effects (even though the variable it sets is used in the second expression). I trialled setting TREE_SIDE_EFFECTS in the hook and that appeared to work but would appreciate feedback on whether that's the "correct" fix or not (and if so confirmation that all relevant hooks should do something similar). Due to this currently C++ code using this builtin fails on x86_64. libstdc++-v3/testsuite/29_atomics/atomic_float/compare_exchange_padding.cc ------------------------------ N.b. the story around C++ is not quite finished yet. Automatic linking in of libatomic is being worked on and we hope to send something upstream this week. Similarly the target hooks that were written for C seem to be problematic for C++ (as described in the point above). The goal is still to include this feature (after having fixed the problem outlined above) and having automatic linking of libatomic. However, I have a few fallback approaches that I'd like to gather opinions on (with the goal of those fallbacks to upstream the libstdc++ changes that I linked earlier in this message): 1) Disable fenv exception handling in the builtin for C++ code for this release. (libstdc++ does not currently hande fenv exception, so this would not be a regression in that code. I don't think adding exception handling in later would be backwards incompatible). 2) Include the builtin-SFINAE patch and the libstdc++ patch (both already upstreamed) into this release knowing that GCC itself will not use this codepath. ------------------------------ Testing done: Bootstrap and regression test passes (with known exceptions of C++ on x86_64, and libatomic linking in general) on x86_64 and AArch64. Cross compiler regression tests pass on arm-linux. Also tested with the compiler installed to check behaviour when libatomic is in the link path (to check that known failures from libatomic linking go away when it's available). Similarly tested with a dummy implementation of fetch_add as an optab in the AArch64 backend to ensure that codepath also works. ------------------------------ Matthew Malcomson (11): libatomic: Split concept of SUFFIX and SIZE in libatomic libatomic: Add floating point implementations of fetch_{add,sub} c: c++: Define new floating point builtin fetch_add functions builtins: Add FP types for atomic builtin overload resolution c: c++: Expand into CAS loop in frontend builtins: optab: Tie the new atomic builtins to the backend testsuite: Add tests for fp resolutions of __atomic_fetch_add c: c++: flag to disable fetch_op handling fenv exceptions doc: Mention floating point atomic fetch_add etc in docs [Not For Commit] Add demo implementation of one of the operations [Not For Commit] Link libatomic into tests when available gcc/builtin-types.def | 20 + gcc/builtins.cc | 176 ++ gcc/builtins.h | 2 + gcc/c-family/c-common.cc | 205 ++- gcc/c-family/c.opt | 6 + gcc/config/aarch64/aarch64.h | 2 + gcc/config/aarch64/aarch64.opt | 5 + gcc/config/aarch64/atomics.md | 15 + gcc/doc/extend.texi | 10 + gcc/fortran/f95-lang.cc | 5 + gcc/fortran/types.def | 17 + gcc/optabs.cc | 19 + gcc/optabs.def | 6 +- gcc/sync-builtins.def | 40 + .../template/builtin-atomic-overloads.def | 28 +- .../template/builtin-atomic-overloads6.C | 23 +- .../template/builtin-atomic-overloads7.C | 16 +- gcc/testsuite/gcc.dg/atomic-op-fp-convert.c | 6 + gcc/testsuite/gcc.dg/atomic-op-fp-errs.c | 14 + gcc/testsuite/gcc.dg/atomic-op-fp-fenv-flag.c | 44 + .../gcc.dg/atomic-op-fp-resolve-complain.c | 5 + gcc/testsuite/gcc.dg/atomic-op-fp.c | 200 +++ gcc/testsuite/gcc.dg/atomic-op-fpf.c | 200 +++ gcc/testsuite/gcc.dg/atomic-op-fpf128.c | 203 +++ gcc/testsuite/gcc.dg/atomic-op-fpf16.c | 203 +++ gcc/testsuite/gcc.dg/atomic-op-fpf16b.c | 203 +++ gcc/testsuite/gcc.dg/atomic-op-fpf32.c | 203 +++ gcc/testsuite/gcc.dg/atomic-op-fpf32x.c | 203 +++ gcc/testsuite/gcc.dg/atomic-op-fpf64.c | 203 +++ gcc/testsuite/gcc.dg/atomic-op-fpf64x.c | 203 +++ gcc/testsuite/gcc.dg/atomic-op-fpl.c | 200 +++ .../gcc.dg/atomic/atomic-op-fp-fenv.c | 376 +++++ .../gcc.target/i386/excess-precision-13.c | 88 + gcc/testsuite/lib/target-supports.exp | 199 ++- libatomic/Makefile.am | 38 +- libatomic/Makefile.in | 44 +- libatomic/acinclude.m4 | 56 +- libatomic/auto-config.h.in | 114 +- libatomic/cas_n.c | 8 +- libatomic/config/linux/aarch64/host-config.h | 23 +- libatomic/config/linux/arm/host-config.h | 2 +- libatomic/config/s390/cas_n.c | 6 +- libatomic/config/s390/exch_n.c | 4 +- libatomic/config/s390/load_n.c | 4 +- libatomic/config/s390/store_n.c | 4 +- libatomic/config/x86/host-config.h | 14 +- libatomic/configure | 1485 ++++++++++++++++- libatomic/configure.ac | 6 + libatomic/exch_n.c | 12 +- libatomic/fadd_n.c | 23 +- libatomic/fop_n.c | 111 +- libatomic/fsub_n.c | 23 + libatomic/libatomic.map | 44 + libatomic/libatomic_i.h | 186 ++- libatomic/load_n.c | 12 +- libatomic/store_n.c | 12 +- libatomic/tas_n.c | 12 +- libatomic/testsuite/Makefile.in | 1 + .../testsuite/libatomic.c/atomic-op-fp-fenv.c | 421 +++++ .../testsuite/libatomic.c/atomic-op-fp.c | 219 +++ .../testsuite/libatomic.c/atomic-op-fpf.c | 219 +++ .../testsuite/libatomic.c/atomic-op-fpf128.c | 220 +++ .../testsuite/libatomic.c/atomic-op-fpf16.c | 223 +++ .../testsuite/libatomic.c/atomic-op-fpf16b.c | 220 +++ .../testsuite/libatomic.c/atomic-op-fpf32.c | 220 +++ .../testsuite/libatomic.c/atomic-op-fpf32x.c | 220 +++ .../testsuite/libatomic.c/atomic-op-fpf64.c | 220 +++ .../testsuite/libatomic.c/atomic-op-fpf64x.c | 220 +++ .../testsuite/libatomic.c/atomic-op-fpl.c | 219 +++ .../testsuite/29_atomics/atomic_float/1.cc | 1 + .../testsuite/29_atomics/atomic_ref/float.cc | 1 + 71 files changed, 7978 insertions(+), 237 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fp-convert.c create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fp-errs.c create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fp-fenv-flag.c create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fp-resolve-complain.c create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fp.c create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf.c create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf128.c create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf16.c create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf16b.c create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf32.c create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf32x.c create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf64.c create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf64x.c create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpl.c create mode 100644 gcc/testsuite/gcc.dg/atomic/atomic-op-fp-fenv.c create mode 100644 gcc/testsuite/gcc.target/i386/excess-precision-13.c create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fp-fenv.c create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fp.c create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf.c create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf128.c create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf16.c create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf16b.c create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf32.c create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf32x.c create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf64.c create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf64x.c create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpl.c -- 2.43.0