http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741
--- Comment #26 from Evgeniy Dushistov <dushistov at mail dot ru> --- I try such simple C++ function, compiled in separate object file(-march=native -Ofast): void mult(const double * const __restrict__ A, const double * const __restrict__ B, double * const __restrict__ C, const size_t N) { for (size_t j = 0; j < N; ++j) for (size_t i = 0; i < N; ++i) for (size_t k = 0; k < N; ++k) C[i * N + j] += A[i * N + k] + B[k * N + j]; } $ time ./test_gcc 204.800000 real 0m9.628s user 0m9.620s sys 0m0.000s $ time ./test_icc 204.800000 real 0m0.637s user 0m0.630s sys 0m0.000s Difference 15.2 times Looks like the difference here: GCC: Analyzing loop at mult.cpp:5 Analyzing loop at mult.cpp:6 Analyzing loop at mult.cpp:7 mult.cpp:3: note: vectorized 0 loops in function. ICC: mult.cpp(5): (col. 2) remark: PERMUTED LOOP WAS VECTORIZED. mult.cpp(5): (col. 2) remark: PERMUTED LOOP WAS VECTORIZED. mult.cpp(5): (col. 2) remark: PERMUTED LOOP WAS VECTORIZED.