Re: GCC and Floating-Point

Uros Bizjak Tue, 24 May 2005 00:05:07 -0700

Hello!

> I'm writing an extensive article about floating-point programming on
> Linux, including a focus on GCC's compilers. This is an outgrowth of
> many debates about topics like -ffast-math and -mfpmath=sse|387, and I
> hope it will prove enlightening for myself and others.


  I would like to point out that for applications that crunch data from real
world (no infinites or nans, where precision is not critical) such as various
simulations, -ffast-math is something that can speed up application a lot.

Regarding i387, -ffast-math does following:

- because NaNs and infinets are not present, fp compares are simplified, as
there is no need for bypases and secondary compares (due to C0 C2 and C3 bits of
FP status word) and cmove instruction can be used in all cases.

- simple builtin functions (sqrt, sin and cos, etc) can be used with hardware
implemented fsqrt, fsin and fcos insn instead of calling library functions.
These can handle arguments up to (something)*2^63, so it is quite enough for
normal apps. This way, a call overhead (saving all FP registers, overhead of
call insn itself) is eliminated, and as an added bonus, sin and cos of the same
argument are combined as fsincos insn.

- not-so-simple (I won't use the word complex here :)) builtin functions (exp,
asin, etc) are expanded on RTL level and CSE is used to eliminate duplicate
calculations.

- floor and ceil functions are implemented as builtin functions and further 
simplified to direct conversion, for example:
(int)floor(double) -> __builtin_lfloor(double).
__builtin_lfloor (and similar builtins) can be implemented directly in i387
using fist(p) insn with appropriate rounding control bits set in control word.

- in addition to this target specific effects, various (otherwise unsafe)
transformations are enabled on middle-level when -ffast-math is used.

  Due to outdated i386 ABI, where all FP parameters are passed on stack, SSE
code does not show all its power when used. When math library function is
called, SSE regs are pushed on stack and called math library function (that is
currently implemented again with i387 insns) pulls these values from stack to
x87 registers. In contrast, x86_64 ABI specifies that FP values are passed in
SSE registers, so they avoid costly SSE reg->stack moves. Until i386 ABI
(together with supporting math functions) is changed to something similar to
x86_64, use of -mfpmath=sse won't show all its power. Another fact is, that x87
intrinsics are currently disabled for -mfpmath=sse, because it was shown that
SSE math libraries (with SSE ABI) are faster for x86_64. Somehow annoying fact
is, that intrinsics are disabled also for i386, where we are still waiting for
ABI to change ;) [Please note that use of SSE intrinsic functions does not rely
on -mfpmath=... settings].

  So, for real-world applications, using i387 with -ffast-math could be
substantially faster than using SSE code. However, the problem lies in math
library headers. These define a lot of inlined asm functions in mathinline.h
header (included when math.h is used). These functions interfear with gcc's
builtins, so -D__NO_MATH_INLINES is needed to fix this problem. The situation is
even worse when SSE code is used. Asm inlines from mathinline.h are implemented
using i387 instructions, so these instructions force parameters to move from SSE
registers to x87 regs (via stack) and the result to move back to SSE reg the
same way. This can be seen when sin(a + 1.0) is compiled with math.h header
included. With -mfpmath=sse, SSE->mem->FP reg moves are needed to satisfy
constraints of inlined sin() code.

  Because SSE->x87 moves are costly, -mfpmath=sse,387 produces unoptimal code.
This option in fact confuses register allocator, and wrong register set is
choosen sometimes. As there is no separate resources for SSE and x87
instructions, the use of -mfpmath=sse,387 is a bit questionable. However, with
-mfpmath=sse, x87 registers could be used for temporary storage in MEM->MEM
moves, they can even do conversions from (double)<->(float) on the fly...

  As an example for this writing, try to benchmark povray with the combination
of following parameters:

-ffast-math
-mpfmath=sse, -mfpmath=387 (and perhaps -mfpmath=387,sse)
-D__NO_MATH_INLINES (this depends on the version of your libc)

Uros.

Re: GCC and Floating-Point

Reply via email to