Hi! Per <https://gcc.gnu.org/PR98321> we've found that GCC's nvptx back end doesn't make use of the PTX 32-bit floating-point add instruction, <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-atom>, as declared in 'gcc/config/nvptx/nvptx.md:atomic_fetch_addsf'.
Trying '__atomic_fetch_add (&a, b, __ATOMIC_RELAXED);' (also outside of the nvptx target context), I found this works fine for integer types, but emits "error: operand type ‘float *’ is incompatible with argument 1 of ‘__atomic_fetch_add’". In 'gcc/doc/extend.texi' we read for '__atomic_fetch_add' referring to '__atomic_add_fetch' that "The object pointed to by the first argument must be of integer or pointer type", which explains where the error diagnostic is coming from. That's 'gcc/c-family/c-common.c:sync_resolve_size', and dates back to 2005 PR14311 "builtins for atomic operations needed" commit r98154 (Git commit 48ae6c138ca30c4c5e876a0be47c9a0b5c8bf5c2). (I haven't studied that in detail, yet.) But: why, what's the rationale? Are potential floating point exceptions the problem, maybe? "Simply, because nobody bothered implementing it (common hardware doesn't provide it)" certainly is fine as an answer -- I just wanted to make sure I'm not missing some "obvious detail". As far as that's relevant here, I do understand that floating-point 'a + (b + c)' may be different from '(a + b) + c' etc.; '-ffast-math' doesn't seem to help here. (Have not yet worked out how to adequately model in GCC the PTX ISA stating that the "Current implementation of 'atom.add.f32' on global memory flushes subnormal inputs and results to sign-preserving zero; whereas 'atom.add.f32' on shared memory supports subnormal inputs and results and doesn't flush them to zero".) Grüße Thomas ----------------- Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf