Hello! > I'm writing an extensive article about floating-point programming on > Linux, including a focus on GCC's compilers. This is an outgrowth of > many debates about topics like -ffast-math and -mfpmath=sse|387, and I > hope it will prove enlightening for myself and others.
I would like to point out that for applications that crunch data from real world (no infinites or nans, where precision is not critical) such as various simulations, -ffast-math is something that can speed up application a lot. Regarding i387, -ffast-math does following: - because NaNs and infinets are not present, fp compares are simplified, as there is no need for bypases and secondary compares (due to C0 C2 and C3 bits of FP status word) and cmove instruction can be used in all cases. - simple builtin functions (sqrt, sin and cos, etc) can be used with hardware implemented fsqrt, fsin and fcos insn instead of calling library functions. These can handle arguments up to (something)*2^63, so it is quite enough for normal apps. This way, a call overhead (saving all FP registers, overhead of call insn itself) is eliminated, and as an added bonus, sin and cos of the same argument are combined as fsincos insn. - not-so-simple (I won't use the word complex here :)) builtin functions (exp, asin, etc) are expanded on RTL level and CSE is used to eliminate duplicate calculations. - floor and ceil functions are implemented as builtin functions and further simplified to direct conversion, for example: (int)floor(double) -> __builtin_lfloor(double). __builtin_lfloor (and similar builtins) can be implemented directly in i387 using fist(p) insn with appropriate rounding control bits set in control word. - in addition to this target specific effects, various (otherwise unsafe) transformations are enabled on middle-level when -ffast-math is used. Due to outdated i386 ABI, where all FP parameters are passed on stack, SSE code does not show all its power when used. When math library function is called, SSE regs are pushed on stack and called math library function (that is currently implemented again with i387 insns) pulls these values from stack to x87 registers. In contrast, x86_64 ABI specifies that FP values are passed in SSE registers, so they avoid costly SSE reg->stack moves. Until i386 ABI (together with supporting math functions) is changed to something similar to x86_64, use of -mfpmath=sse won't show all its power. Another fact is, that x87 intrinsics are currently disabled for -mfpmath=sse, because it was shown that SSE math libraries (with SSE ABI) are faster for x86_64. Somehow annoying fact is, that intrinsics are disabled also for i386, where we are still waiting for ABI to change ;) [Please note that use of SSE intrinsic functions does not rely on -mfpmath=... settings]. So, for real-world applications, using i387 with -ffast-math could be substantially faster than using SSE code. However, the problem lies in math library headers. These define a lot of inlined asm functions in mathinline.h header (included when math.h is used). These functions interfear with gcc's builtins, so -D__NO_MATH_INLINES is needed to fix this problem. The situation is even worse when SSE code is used. Asm inlines from mathinline.h are implemented using i387 instructions, so these instructions force parameters to move from SSE registers to x87 regs (via stack) and the result to move back to SSE reg the same way. This can be seen when sin(a + 1.0) is compiled with math.h header included. With -mfpmath=sse, SSE->mem->FP reg moves are needed to satisfy constraints of inlined sin() code. Because SSE->x87 moves are costly, -mfpmath=sse,387 produces unoptimal code. This option in fact confuses register allocator, and wrong register set is choosen sometimes. As there is no separate resources for SSE and x87 instructions, the use of -mfpmath=sse,387 is a bit questionable. However, with -mfpmath=sse, x87 registers could be used for temporary storage in MEM->MEM moves, they can even do conversions from (double)<->(float) on the fly... As an example for this writing, try to benchmark povray with the combination of following parameters: -ffast-math -mpfmath=sse, -mfpmath=387 (and perhaps -mfpmath=387,sse) -D__NO_MATH_INLINES (this depends on the version of your libc) Uros.