> [...] > I have made a few optimized functions myself and published them as a > multi-platform library (www.agner.org/optimize/asmlib.zip). It is > faster than most other libraries on an Intel Core2 and up to ten > times faster than gcc using builtin functions. My library is > published with GPL license, but I will allow you to use my code in > gnu libc if you wish (Sorry, I don't have the time to work on the gnu > project myself, but you may contact me for details about the code). > [...]
But then it's not gcc that is the best optimising compiler, but it's the best library *hand optimised so that gcc compiles it very well*. Here's an example: void foo( void ) { unsigned x; for ( x = 0 ; x < 200 ; x++ ) func(); } void bar( void ) { unsigned x; for ( x = 201 ; --x ; ) func(); } foo() and bar() are completely equivalent, they call func() 200 times and that's all. Yet, if you compile them with -O3 for arm-elf target with version 4.0.2 (yes, I know, it's an ancient version, but still) bar() will be 6 insns long with the loop itself being 3 while foo() compiles to 7 insns of which 4 is the loop. In fact, the compiler is clever enough to transform bar()'s loop from for ( x = 201 ; --x ; ) func(); to x = 200; do func() while ( --x ); internally, the latter form being shorter to evaluate and since x is not used other than as the loop counter it doesn't matter. However, it is not clever enough to figure out that foo()'s loop is doing exactly what bar()'s is doing. Since x is only the loop counter, gcc could transform foo()'s loop to bar()'s freely but it doesn't. It generates the equivalent of this: x = 0; do { x += 1; func(); } while ( x != 240 ); that is not as efficient as what it generates from bar()'s code. Of course you get surprised when you change -O3 to -Os, in which case gcc suddenly realises that foo() can indeed be transformed to the internal representation that it used for bar() with -O3. Thus, we have foo() now being only 6 insns long with a 3 insn loop. Unfortunately, bar() is not that lucky. Although it's loop remains 3 insns long, the entire function is increased by an additional instruction, for bar() internally now looks like this: x = 201; goto label; do { func(); label: } while ( --x ); You can play with gcc and see which one of the equivalent C constructs it compiles to better code with any particular -O level (and if you have to work with severely constrained embedded systems you often do) but then hand-crafting your C code to fit gcc's taste is actually not that good an idea. With the next release, when different constructs will be recognised, you may end up with larger and/or slower code (as it happened to me when changing 4.0.x -> 4.3.x and before when going from 2.9.x to 3.1.x). Gcc will be the best optimising compiler when it will generate faster/shorter code that the other compilers on the majority of a large set of arbitrary, *not* hand-optimised sources. Preferrably for most targets, not only for the x86, if possible :-) Zoltan