https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930
--- Comment #7 from Adam Hirst <adam at aphirst dot karoo.co.uk> --- OK, I tried a little harder, and was able to get a performance increase. type(Vect3D) pure function TP_LEFT(NU, D, NV) result(tensorproduct) real(dp), intent(in) :: NU(4), NV(4) type(Vect3D), intent(in) :: D(4,4) real(dp) :: Dx(4,4), Dy(4,4), Dz(4,4), NUDx(4), NUDy(4), NUDz(4) Dx = D%x Dy = D%y Dz = D%z NUDx = matmul(NU, Dx) NUDy = matmul(NU, Dy) NUDz = matmul(NU, Dz) tensorproduct%x = dot_product(NUDx,NV) tensorproduct%y = dot_product(NUDy,NV) tensorproduct%z = dot_product(NUDz,NV) end function The result of this (still using -Ofast) is that the matmul path sped up by a factor of about 6 (on my machine), which would have placed it now faster than the "explicit DO" approach, but that too gained a huge reduction under -Ofast, so the result is that matmul here is about half as fast as the explicit loop. But here is where things get really interesting. If also use -flto on this post's matmul codepath, I get the result that the matmul implementation is twice as fast as the (already now VERY fast) DO-implementation. This huge boost doesn't seem to apply to the version of TP_LEFT from my previous post, nor to the original TP_LEFT from the initial ticket submission. In conclusion: It seems that your remark about matmul inlining also applies to dot_product. NOTE: For the -flto tests, gcc is clever enough to realise that we're not actually using these results, so I have to save tp(1:i_max) and have the user specify an element to print, in order to force the computation. I of course put those "outside" each pair of cpu_time calls. As an aside, I also tried the effect of -fexpensive-optimizations but it did more or less nothing. --- By the way, are there any thoughts yet on the random number calls taking /longer/ once optimisations are enabled? If I'm reading my results right, -flto seems to "fix" that, but it doesn't seem obvious that it should be occurring in the first place.