https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008

--- Comment #10 from Uroš Bizjak <ubizjak at gmail dot com> ---
FYI, the following testcase:

--cut here--
#include <math.h>

float
__attribute__((noinline))
_fmodf (float x, float y)
{
  return x - truncf (x/y) * y;
}

int
main ()
{

  float a, b;
  volatile float z;

  for (a = -1000.0f; a < 1000.0f; a += 0.01f)
    for (b = -1000.0f; b < 1000.0f; b += 0.1f)
      z = fmodf (a, b);

  return 0;
}
--cut here--

$ gcc -Ofast -lm fmod-bench.c

      22,127092116 seconds time elapsed

      22,125111000 seconds user
       0,000999000 seconds sys


$ gcc -Ofast -fno-builtin-fmodf -lm fmod-bench.c

      32,751589079 seconds time elapsed

      32,746156000 seconds user
       0,000999000 seconds sys


Which points that the x87 code is considerably faster on my target
(Ivybridge-E) on Fedora-34 with glibc-2.33.

For reference, when the above _fmodf is called, I get:

$ gcc -Ofast -lm fmod-bench.c

      10,706189749 seconds time elapsed

      10,704859000 seconds user
       0,000999000 seconds sys

$ gcc -Ofast -lm -msse4 fmod-bench.c

      11,391062747 seconds time elapsed

      11,390771000 seconds user
       0,000000000 seconds sys

So, considerable faster!

It looks that with -ffast-math it is not inlined x87 code that is problematic,
but the missing fmod transformation. As shown above, the SSE2 code for truncf
is on par with SSE4 roundss instruction, so if the target can provide optimized
truncf code, the fmodf should definitely be converted to "a - trunc (a/p) * p".

Reply via email to