https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116128
--- Comment #3 from mjr19 at cam dot ac.uk --- It seems that most of these are in-line expanded by gfortran-14.1, at least in some cases. function foo(a,n) real(kind(1d0))::a(*),foo integer::n foo=sum(a(1:n)) end function foo and function foo(a) real(kind(1d0)), contiguous::a(:) real(kind(1d0)):: foo foo=sum(a) end function foo are both inline expanded, and with gfortran-14.1 -O3 -mavx2 -S give .L5: vaddsd (%rax), %xmm0, %xmm0 addq $32, %rax vaddsd -24(%rax), %xmm0, %xmm0 vaddsd -16(%rax), %xmm0, %xmm0 vaddsd -8(%rax), %xmm0, %xmm0 cmpq %rdx, %rax jne .L5 and .L5: movq %rdx, %rax addq $1, %rdx salq $5, %rax addq %rcx, %rax vaddsd (%rax), %xmm0, %xmm0 vaddsd 8(%rax), %xmm0, %xmm0 vaddsd 16(%rax), %xmm0, %xmm0 vaddsd 24(%rax), %xmm0, %xmm0 cmpq %rdx, %rsi jne .L5 At -O2 there is no unrolling, but the benefits of unrolling whilst retaining the data dependency on %xmm0 must be very slight. (I am also slightly confused that using contiguous is not equivalent to the a(*) version.) At -Ofast one gets .L5: vaddpd (%rax), %ymm0, %ymm0 addq $32, %rax cmpq %rdx, %rax jne .L5 and .L5: movq %rax, %rdx addq $1, %rax salq $5, %rdx vaddpd (%rcx,%rdx), %ymm0, %ymm0 cmpq %rax, %rsi jne .L5 In contrast nvfortran at -O2 and above produces .LBB0_7: movq %rdx, %r8 vaddpd (%rsi), %ymm0, %ymm0 vaddpd 32(%rsi), %ymm1, %ymm1 vaddpd 64(%rsi), %ymm2, %ymm2 vaddpd 96(%rsi), %ymm3, %ymm3 subq $-128, %rsi addq $-16, %rdx addq $-15, %r8 cmpq $16, %r8 jg .LBB0_7 I am not sure that I like the loop control instructions, but the vaddpd's to four separate sums look good as this breaks the data dependencies between the longish-latency vaddpd's. If the code had been written with an explicit loop foo=0 do i=1,n foo=foo+a(i) enddo then it would not be standard-conformant to rearrange into partial sums as gfortran does at -Ofast, and nvfortran does at -O2. But my understanding is that the intrinsic sum function does not specify the ordering, and hence any ordering would be standard-compliant, and, arguably, the use of partial sums actually increases the accuracy for general data, and so should be preferable from an accuracy point of view, as well as for performance. Gfortran appears not to support the newly-introduced "reduce" qualifier of "do concurrent". Other simple reduction intrinsics are expanded inline in simple cases, for instance minval. function foo(a,n) real(kind(1d0))::a(*),foo integer::n foo=minval(a(1:n)) end function foo produces .L6: vmovsd (%rax), %xmm1 addq $8, %rax vminsd %xmm0, %xmm1, %xmm0 cmpq %rdx, %rax jne .L6 at -O3 -mavx2, and .L5: vminpd (%rax), %ymm0, %ymm0 addq $32, %rax cmpq %rax, %rdx jne .L5 at -Ofast -mavx2. I don't see a significant difference in the handling of NaNs between these versions? Currently program foo real(kind(1d0))::a(6)=(/1,2,3,4,5,6/) a(4)=sqrt(a(3)-a(4)) write(*,*)minval(a),maxval(a),a(4) end program foo prints "1.0 6.0 NaN" which is okay (Fortran does not define the effect of NaNs on minval and maxval), but perhaps not quite what one would expect. Intel's minsd and vminpd do not handle NaNs in the way one might expect either. Even when Fortran claims IEEE support, it does not require full support, see 17.9 of the F2023 Standard. Fortran provides the ieee_min function if one wants fully IEEE compliant min.