>For GCC, I used in both cases the flags -march=pentium4 -mfpmath=sse -O3 -fomit-frame-pointer -ffast-math
As for gcc4 vs gcc3.4, degradataion on x86 architecture is most probably because of higher register pressure created with more aggressive SSA optimizations in gcc4.
Try these five combinations:
-O2 -fomit-frame-pointer -ffast-math -O2 -fomit-frame-pointer -ffast-math -fno-tree-pre -O2 -fomit-frame-pointer -ffast-math -fno-tree-pre -fno-gcse
-O3 -fomit-frame-pointer -ffast-math -fno-tree-pre -O3 -fomit-frame-pointer -ffast-math -fno-tree-pre -fno-gcse
You may also want to try -mfpmath=sse,387 in case your benchmarks use sin, cos and other trascendental functions that GCC knows about when using 387 instructions.
Paolo