https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89371
--- Comment #3 from Arnaud Desitter <arnaud02 at users dot sourceforge.net> --- Considering: #include <vector> #include <iostream> #include <numeric> void ff(double* res, double const* a, double const* b, int n1, int n2) { #pragma omp simd collapse(2) for(int i1=0; i1 < n1; ++i1) { for(int i2=0; i2 < n2; ++i2) { res[i1*n2+i2] = a[i1*n2+i2]-b[i1*n2+i2]; } } } int main() { const auto repeat = 100*100; const std::size_t n1 = 100*1000; const std::size_t n2 = 3; std::vector<double> res(n1*n2), a(n1*n2), b(n1*n2); std::iota(a.begin(), a.end(), 1.0); std::iota(b.begin(), b.end(), -200.0); for(int r=repeat; r>0; --r) ff(res.data(), a.data(), b.data(), n1, n2); std::cout << res[0] << '\n'; } Using clang 8.0: >clang++ -O3 main2.cpp >/usr/bin/time ./a.out > /dev/null 2.93user 0.00system 0:02.94elapsed 99%CPU (0avgtext+0avgdata 8424maxresident)k >clang++ -fopenmp-simd -O3 main2.cpp > /dev/null >/usr/bin/time ./a.out > /dev/null 2.83user 0.00system 0:02.83elapsed 99%CPU (0avgtext+0avgdata 8492maxresident)k 0inputs+0outputs (0major+2215minor)pagefaults 0swaps Using gcc 9.1.0: >g++ -O3 main2.cpp >/usr/bin/time ./a.out > /dev/null 3.49user 0.00system 0:03.50elapsed 99%CPU (0avgtext+0avgdata 8488maxresident)k 0inputs+0outputs (0major+2215minor)pagefaults 0swaps >g++ -fopenmp-simd -O3 main2.cpp >/usr/bin/time ./a.out > /dev/null 5.83user 0.00system 0:05.84elapsed 99%CPU (0avgtext+0avgdata 8492maxresident)k 0inputs+0outputs (0major+2215minor)pagefaults 0swaps clang 8.0 is able to produce vectorised code using "#pragma omp simd collapse(2)" whereas gcc 9.1.0 cannot. For record, clang 7.0 produces terrible code for this example.