https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116128

--- Comment #3 from mjr19 at cam dot ac.uk ---
It seems that most of these are in-line expanded by gfortran-14.1, at least in
some cases.

function foo(a,n)
  real(kind(1d0))::a(*),foo
  integer::n 

  foo=sum(a(1:n))
end function foo

and

function foo(a)
  real(kind(1d0)), contiguous::a(:)
  real(kind(1d0)):: foo

  foo=sum(a)
end function foo

are both inline expanded, and with gfortran-14.1 -O3 -mavx2 -S give

.L5:
        vaddsd  (%rax), %xmm0, %xmm0
        addq    $32, %rax
        vaddsd  -24(%rax), %xmm0, %xmm0
        vaddsd  -16(%rax), %xmm0, %xmm0
        vaddsd  -8(%rax), %xmm0, %xmm0
        cmpq    %rdx, %rax
        jne     .L5

and

.L5:
        movq    %rdx, %rax
        addq    $1, %rdx
        salq    $5, %rax
        addq    %rcx, %rax
        vaddsd  (%rax), %xmm0, %xmm0
        vaddsd  8(%rax), %xmm0, %xmm0
        vaddsd  16(%rax), %xmm0, %xmm0
        vaddsd  24(%rax), %xmm0, %xmm0
        cmpq    %rdx, %rsi
        jne     .L5

At -O2 there is no unrolling, but the benefits of unrolling whilst retaining
the data dependency on %xmm0 must be very slight. (I am also slightly confused
that using contiguous is not equivalent to the a(*) version.)

At -Ofast one gets

.L5:
        vaddpd  (%rax), %ymm0, %ymm0
        addq    $32, %rax
        cmpq    %rdx, %rax
        jne     .L5

and

.L5:
        movq    %rax, %rdx
        addq    $1, %rax
        salq    $5, %rdx
        vaddpd  (%rcx,%rdx), %ymm0, %ymm0
        cmpq    %rax, %rsi
        jne     .L5

In contrast nvfortran at -O2 and above produces

.LBB0_7:
        movq    %rdx, %r8
        vaddpd  (%rsi), %ymm0, %ymm0
        vaddpd  32(%rsi), %ymm1, %ymm1
        vaddpd  64(%rsi), %ymm2, %ymm2
        vaddpd  96(%rsi), %ymm3, %ymm3
        subq    $-128, %rsi
        addq    $-16, %rdx
        addq    $-15, %r8
        cmpq    $16, %r8
        jg      .LBB0_7

I am not sure that I like the loop control instructions, but the vaddpd's to
four separate sums look good as this breaks the data dependencies between the
longish-latency vaddpd's.

If the code had been written with an explicit loop

foo=0
do i=1,n
  foo=foo+a(i)
enddo

then it would not be standard-conformant to rearrange into partial sums as
gfortran does at -Ofast, and nvfortran does at -O2. But my understanding is
that the intrinsic sum function does not specify the ordering, and hence any
ordering would be standard-compliant, and, arguably, the use of partial sums
actually increases the accuracy for general data, and so should be preferable
from an accuracy point of view, as well as for performance.

Gfortran appears not to support the newly-introduced "reduce" qualifier of "do
concurrent".

Other simple reduction intrinsics are expanded inline in simple cases, for
instance minval.

function foo(a,n)
  real(kind(1d0))::a(*),foo
  integer::n

  foo=minval(a(1:n))
end function foo

produces

.L6:
        vmovsd  (%rax), %xmm1
        addq    $8, %rax
        vminsd  %xmm0, %xmm1, %xmm0
        cmpq    %rdx, %rax
        jne     .L6

at -O3 -mavx2, and

.L5:
        vminpd  (%rax), %ymm0, %ymm0
        addq    $32, %rax
        cmpq    %rax, %rdx
        jne     .L5

at -Ofast -mavx2. I don't see a significant difference in the handling of NaNs
between these versions? Currently

program foo
  real(kind(1d0))::a(6)=(/1,2,3,4,5,6/)
  a(4)=sqrt(a(3)-a(4))
  write(*,*)minval(a),maxval(a),a(4)
end program foo

prints "1.0 6.0 NaN" which is okay (Fortran does not define the effect of NaNs
on minval and maxval), but perhaps not quite what one would expect. Intel's
minsd and vminpd do not handle NaNs in the way one might expect either. Even
when Fortran claims IEEE support, it does not require full support, see 17.9 of
the F2023 Standard. Fortran provides the ieee_min function if one wants fully
IEEE compliant min.

Reply via email to