[Bug fortran/116128] missed optimisation: fortran sum instrinsic performed in order

anlauf at gcc dot gnu.org via Gcc-bugs Wed, 07 Aug 2024 13:52:11 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116128


--- Comment #4 from anlauf at gcc dot gnu.org ---
(In reply to mjr19 from comment #3)
> It seems that most of these are in-line expanded by gfortran-14.1, at least
> in some cases.
> 
> function foo(a,n)
>   real(kind(1d0))::a(*),foo
>   integer::n 
> 
>   foo=sum(a(1:n))
> end function foo
> 
> and
> 
> function foo(a)
>   real(kind(1d0)), contiguous::a(:)
>   real(kind(1d0)):: foo
> 
>   foo=sum(a)
> end function foo
> 
> are both inline expanded, and with gfortran-14.1 -O3 -mavx2 -S give
> 
> .L5:
>         vaddsd  (%rax), %xmm0, %xmm0
>         addq    $32, %rax
>         vaddsd  -24(%rax), %xmm0, %xmm0
>         vaddsd  -16(%rax), %xmm0, %xmm0
>         vaddsd  -8(%rax), %xmm0, %xmm0
>         cmpq    %rdx, %rax
>         jne     .L5
> 
> and
> 
> .L5:
>         movq    %rdx, %rax
>         addq    $1, %rdx
>         salq    $5, %rax
>         addq    %rcx, %rax
>         vaddsd  (%rax), %xmm0, %xmm0
>         vaddsd  8(%rax), %xmm0, %xmm0
>         vaddsd  16(%rax), %xmm0, %xmm0
>         vaddsd  24(%rax), %xmm0, %xmm0
>         cmpq    %rdx, %rsi
>         jne     .L5
> 
> At -O2 there is no unrolling, but the benefits of unrolling whilst retaining
> the data dependency on %xmm0 must be very slight. (I am also slightly
> confused that using contiguous is not equivalent to the a(*) version.)

Indeed, this is an interesting observation!  Comparing the generated tree
code and assembler for the second (contiguous) version, using either

  foo = sum (a)

or

  foo = sum (a(1:size(a)))

show slight differences which might be worth to investigate.

> At -Ofast one gets
> 
> .L5:
>         vaddpd  (%rax), %ymm0, %ymm0
>         addq    $32, %rax
>         cmpq    %rdx, %rax
>         jne     .L5
> 
> and
> 
> .L5:
>         movq    %rax, %rdx
>         addq    $1, %rax
>         salq    $5, %rdx
>         vaddpd  (%rcx,%rdx), %ymm0, %ymm0
>         cmpq    %rax, %rsi
>         jne     .L5

I played with your examples.  Indeed, the generated code looks conservative,
with improvements going from -O2 -> -O3 -> -O3 -ffast-math -> -Ofast .
The frontend generates straightforward tree code, so even allowing
reassociation by the optimizer will not make it possible to take full
advantage of vectorization similar to nvfortran.  Unless the middle-end
allowed to annotate loops/loop nests to make a transformation of the loop.

Of course such an optimization would only be allowable for types for which
reassociation is allowed (or with -ffast-math).

(Compilers for real vector machines do generate code for recursively
calculating partial sums, otherwise these machines would be painfully slow.)

What gcc seems to lack is support for annotating loop nests to collapse
loops - I only see mentioning of the explicit OMP annotation.  If we loop
over elements of a contiguous array, this may result in reduction of a
significant loop overhead.

> In contrast nvfortran at -O2 and above produces
> 
> .LBB0_7:
>         movq    %rdx, %r8
>         vaddpd  (%rsi), %ymm0, %ymm0
>         vaddpd  32(%rsi), %ymm1, %ymm1
>         vaddpd  64(%rsi), %ymm2, %ymm2
>         vaddpd  96(%rsi), %ymm3, %ymm3
>         subq    $-128, %rsi
>         addq    $-16, %rdx
>         addq    $-15, %r8
>         cmpq    $16, %r8
>         jg      .LBB0_7
> 
> I am not sure that I like the loop control instructions, but the vaddpd's to
> four separate sums look good as this breaks the data dependencies between
> the longish-latency vaddpd's.

One could emulate this by switching to a fast library version of sum for
sufficiently large array arguments (with a threshold to be determined,
like we do e.g. for using dgemm from blas instead of matmul).

> If the code had been written with an explicit loop
> 
> foo=0
> do i=1,n
>   foo=foo+a(i)
> enddo
> 
> then it would not be standard-conformant to rearrange into partial sums as
> gfortran does at -Ofast, and nvfortran does at -O2.

Right.

> But my understanding is
> that the intrinsic sum function does not specify the ordering, and hence any
> ordering would be standard-compliant,

Agree here.

> and, arguably, the use of partial sums
> actually increases the accuracy for general data, and so should be
> preferable from an accuracy point of view, as well as for performance.

I do not see this.  Why should it increase accuracy for *general* data?
It's just faster but numerically different, isn't it?

> Gfortran appears not to support the newly-introduced "reduce" qualifier of
> "do concurrent".

No, not yet.  But there are open pr's on this, and I think there was a
project.  Do not know about it's status, though.

However, how is "do concurrent" going to help here?  Parallel reduction
operations do have their own overhead and normally try to generate the
same result independent of the number of threads, which looks orthogonal
to the transformation into SIMD operations with partial sums.

> Other simple reduction intrinsics are expanded inline in simple cases, for
> instance minval.
> 
> function foo(a,n)
>   real(kind(1d0))::a(*),foo
>   integer::n
> 
>   foo=minval(a(1:n))
> end function foo
> 
> produces
> 
> .L6:
>         vmovsd  (%rax), %xmm1
>         addq    $8, %rax
>         vminsd  %xmm0, %xmm1, %xmm0
>         cmpq    %rdx, %rax
>         jne     .L6
> 
> at -O3 -mavx2, and
> 
> .L5:
>         vminpd  (%rax), %ymm0, %ymm0
>         addq    $32, %rax
>         cmpq    %rax, %rdx
>         jne     .L5
> 
> at -Ofast -mavx2. I don't see a significant difference in the handling of
> NaNs between these versions? Currently
> 
> program foo
>   real(kind(1d0))::a(6)=(/1,2,3,4,5,6/)
>   a(4)=sqrt(a(3)-a(4))
>   write(*,*)minval(a),maxval(a),a(4)
> end program foo
> 
> prints "1.0 6.0 NaN" which is okay (Fortran does not define the effect of
> NaNs on minval and maxval), but perhaps not quite what one would expect.
> Intel's minsd and vminpd do not handle NaNs in the way one might expect
> either. Even when Fortran claims IEEE support, it does not require full
> support, see 17.9 of the F2023 Standard. Fortran provides the ieee_min
> function if one wants fully IEEE compliant min.

Intel has also issues with minloc.  Gfortran can inline this intrinsic
for rank-1 (gcc <= 14), and there is work on inlining it more generally.

(Hint: try minloc ([nan, inf]) for different compilers).

minval/maxval are interesting by themselves, as some encoders run both
on the same dataset.  I know of one compiler which does both at the same
time if it detects a certain pattern.

[Bug fortran/116128] missed optimisation: fortran sum instrinsic performed in order

Reply via email to