https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008
--- Comment #10 from Uroš Bizjak <ubizjak at gmail dot com> --- FYI, the following testcase: --cut here-- #include <math.h> float __attribute__((noinline)) _fmodf (float x, float y) { return x - truncf (x/y) * y; } int main () { float a, b; volatile float z; for (a = -1000.0f; a < 1000.0f; a += 0.01f) for (b = -1000.0f; b < 1000.0f; b += 0.1f) z = fmodf (a, b); return 0; } --cut here-- $ gcc -Ofast -lm fmod-bench.c 22,127092116 seconds time elapsed 22,125111000 seconds user 0,000999000 seconds sys $ gcc -Ofast -fno-builtin-fmodf -lm fmod-bench.c 32,751589079 seconds time elapsed 32,746156000 seconds user 0,000999000 seconds sys Which points that the x87 code is considerably faster on my target (Ivybridge-E) on Fedora-34 with glibc-2.33. For reference, when the above _fmodf is called, I get: $ gcc -Ofast -lm fmod-bench.c 10,706189749 seconds time elapsed 10,704859000 seconds user 0,000999000 seconds sys $ gcc -Ofast -lm -msse4 fmod-bench.c 11,391062747 seconds time elapsed 11,390771000 seconds user 0,000000000 seconds sys So, considerable faster! It looks that with -ffast-math it is not inlined x87 code that is problematic, but the missing fmod transformation. As shown above, the SSE2 code for truncf is on par with SSE4 roundss instruction, so if the target can provide optimized truncf code, the fmodf should definitely be converted to "a - trunc (a/p) * p".