https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116738
Bug ID: 116738 Summary: Constant folding of _mm_min_ss and _mm_max_ss is wrong Product: gcc Version: 14.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: kobalicek.petr at gmail dot com Target Milestone: --- GCC incorrectly optimizes x86 intrinsics, which have a defined operation at the ISA level. It seems that the problem happens when a value is known at compile time, hence constant folding uses a different operation compared to the CPU when executed as an instruction. Here is the definition of [V]MINSS: ``` MIN(SRC1, SRC2) { IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST := SRC2; ELSE IF (SRC1 = NaN) THEN DEST := SRC2; FI; ELSE IF (SRC2 = NaN) THEN DEST := SRC2; FI; ELSE IF (SRC1 < SRC2) THEN DEST := SRC1; ELSE DEST := SRC2; FI; } ``` So, it's clear that the SRC1 is only selected when an ordered comparison `SRC1 < SRC2` is true. However, GCC doesn't seem to respect this detail. Here is a test case that I was able to craft: ``` // Demonstration of a GCC bug in constant folding of SIMD intrinsics: #include <x86intrin.h> #include <numeric> #include <limits> #include <stdio.h> float clamp(float f) { __m128 v = _mm_set_ss(f); __m128 zero = _mm_setzero_ps(); __m128 greatest = _mm_set_ss(std::numeric_limits<float>::max()); v = _mm_min_ss(v, greatest); v = _mm_max_ss(v, zero); return _mm_cvtss_f32(v); } int main() { printf("clamp(-0) -> %f\n", clamp(-0.0f)); printf("clamp(nan) -> %f\n", clamp(std::numeric_limits<float>::quiet_NaN())); return 0; } ``` GCC results (wrong): clamp(-0) -> -0.000000 clamp(nan) -> nan Clang results (expected): clamp(-0) -> 0.000000 clamp(nan) -> 340282346638528859811704183484516925440.000000 Here is a compiler explorer link: - https://godbolt.org/z/6afjoaj86 I'm aware this is a possible duplicate of an [UNCONFIRMED] bug: - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99497 However, fast-math is mentioned in that bug report, and I'm not interested in fast-math at all, I'm not using that option. This bug makes it impossible to create test cases for the implementation of some optimized functions that I use as tests with GCC fail, but other compilers produce correct results. A possible workaround is to use _ps instead of _ss variant of the intrinsics, but that's also something I would like to avoid as in some cases I really work with a scalar value only. Also interestingly, when compiled by GCC in debug mode (without optimizations) GCC behaves correctly, so this bug is related to the optimization pipeline as well. I'm not aware of any UB in this test case.