During "bashmark" memory benchmark perfomance analyze, I found 100x perfomance regression between gcc 3.4.5 and gcc 4.X.
------ test_cmd.cpp (simplified bashmark memory RW test) ------- #include <stdint.h> #include <cstring> template <const uint8_t Block_Size, const uint32_t Loops> static void int_membench(uint8_t* mb1, uint8_t* mb2) { for(uint32_t i = 0; i < Loops; i+=1) { #define T memcpy(mb1, mb2, Block_Size); memset(mb2, i, Block_Size); T T T T T T T T T T #undef T } } template <const uint32_t Buf_Size, const uint32_t Loops> static void membench() { static uint8_t mb1[Buf_Size]; static uint8_t mb2[Buf_Size]; for(uint32_t i = 0; i < 10000; i+=1) int_membench<Buf_Size, Loops>(mb1, mb2); } int main() { membench<128, 4000>(); return 0; } --------------------------------------------------------------- GCC 3.4.5: 0.43user 0.00system 0:00.44elapsed GCC 4.0.2: 34.83user 0.68system 0:36.09elapsed GCC 4.1.0: 33.86user 0.58system 0:34.96elapsed Compiler options: -march=athlon-xp -O3 -fomit-frame-pointer -mfpmath=sse -msse -ftracer -fweb -maccumulate-outgoing-args -ffast-math I've played with various settings (-O2, -O1, without march, without tracer and web, etc) without any serious difference. I.e. GCC4 is always many times slower than GCC 3.4.5. Lurking inside assembler generation showed that GCC4 don't inline memcpy and memset calls. ------ test.c (uber simplified problem demonstration) --------- #include <string.h> char* f(char* b) { static char a[64]; memcpy(a, b, 64); memset(a, 0, 64); return a; } ---------------------------------------------------------------- GCC4 will generate calls to memcpy and memset in this example. GCC3 will inline all calls. So, it looks like GCC4 inliner is broken at some point.