https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #8 from Chris Elrod <elrodc at gmail dot com> --- Created attachment 45358 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45358&action=edit gfortran compiled assembly for the tranposed version of the original code. Here is the assembly for the loop body of the transposed version of the code, compiled by gfortran: .L8: vmovss 36(%rsi), %xmm0 addq $40, %rsi vrsqrtss %xmm0, %xmm2, %xmm2 addq $12, %rdi vmulss %xmm0, %xmm2, %xmm0 vmulss %xmm2, %xmm0, %xmm0 vmulss %xmm7, %xmm2, %xmm2 vaddss %xmm8, %xmm0, %xmm0 vmulss %xmm2, %xmm0, %xmm0 vmulss -8(%rsi), %xmm0, %xmm5 vmulss -12(%rsi), %xmm0, %xmm4 vmulss -32(%rsi), %xmm0, %xmm0 vmovaps %xmm5, %xmm3 vfnmadd213ss -16(%rsi), %xmm5, %xmm3 vmovaps %xmm4, %xmm2 vfnmadd213ss -20(%rsi), %xmm5, %xmm2 vmovss %xmm0, -4(%rdi) vrsqrtss %xmm3, %xmm1, %xmm1 vmulss %xmm3, %xmm1, %xmm3 vmulss %xmm1, %xmm3, %xmm3 vmulss %xmm7, %xmm1, %xmm1 vaddss %xmm8, %xmm3, %xmm3 vmulss %xmm1, %xmm3, %xmm3 vmulss %xmm3, %xmm2, %xmm6 vmovaps %xmm4, %xmm2 vfnmadd213ss -24(%rsi), %xmm4, %xmm2 vfnmadd231ss %xmm6, %xmm6, %xmm2 vrsqrtss %xmm2, %xmm10, %xmm10 vmulss %xmm2, %xmm10, %xmm1 vmulss %xmm10, %xmm1, %xmm1 vmulss %xmm7, %xmm10, %xmm10 vaddss %xmm8, %xmm1, %xmm1 vmulss %xmm10, %xmm1, %xmm1 vmulss %xmm1, %xmm3, %xmm2 vmulss %xmm6, %xmm2, %xmm2 vmovss -36(%rsi), %xmm6 vxorps %xmm9, %xmm2, %xmm2 vmulss %xmm6, %xmm2, %xmm10 vmulss %xmm2, %xmm5, %xmm2 vfmadd231ss -40(%rsi), %xmm1, %xmm10 vfmadd132ss %xmm4, %xmm2, %xmm1 vfnmadd132ss %xmm0, %xmm10, %xmm1 vmulss %xmm0, %xmm5, %xmm0 vmovss %xmm1, -12(%rdi) vsubss %xmm0, %xmm6, %xmm0 vmulss %xmm3, %xmm0, %xmm3 vmovss %xmm3, -8(%rdi) cmpq %rsi, %rax jne .L8 While Flang had a second loop of scalar code (to catch the N mod [SIMD vector width] remainder of the vectorized loop), there are no secondary loops in the gfortran code, meaning these must all be scalar operations (I have a hard time telling apart SSE from scalar code...). It looks similar in the operations it performs to Flang's vectorized loop, except that it is only performing operations on a single number at a time. Because to get efficient vectorization, we need corresponding elements to be contiguous (ie, all the input1s, all the input2s). We do not get any benefit from having all the different elements with the same index (the first input1 next to the first input2, next to the first input3...) being contiguous. The memory layout I used is performance-optimal, but is something that gfortran unfortunately often cannot handle automatically (without manual unrolling). This is why I filed a report on bugzilla.