http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46900
--- Comment #4 from Tobias Burnus <burnus at gcc dot gnu.org> 2010-12-12 10:50:09 UTC --- (In reply to comment #3) > (I don't understand why the MATMUL part differs that much - it should call the > same BLAS function [via the same GCC 4.6 libgfortran.so wrapper] and LTO > should > not affect it.) Seemingly, LTO is crucial for 4.5 - without LTO dgemm gets slower but the libgfortran version gets faster: $ gfortran-4.5 -fexternal-blas -O3 -ffast-math -march=native test.f90 dgemm.f lsame.f xerbla.f && ./a.out Time, MATMUL: 1.3200819 53.480084765505403 dgemm: 1.3120821 56.452265589399069 $ gfortran-4.5 -c -flto -fexternal-blas -O3 -ffast-math -march=native test.f90 dgemm.f lsame.f xerbla.f $ gfortran-4.5 -flto -O3 -ffast-math -march=native test.o dgemm.o lsame.o xerbla.o $ ./a.out Time, MATMUL: 1.3080810 53.480084765505403 dgemm: 1.0800680 56.452265589399069 Here, for GCC 4.5, one sees that for the direct call of dgemm, LTO improves the performance - and doing a single step compilation+linkage or in two steps does not matter. However, also for GCC 4.5 the single-step pessimizes the performance of the libgfortran MATMUL (which is a wrapper for dgemm).