3x assembly and running 10x slower than manual complete unrolling

elrodc at gmail dot com Sat, 21 Jul 2018 19:44:22 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625


            Bug ID: 86625
           Summary: funroll-loops doesn't unroll, producing >3x assembly
                    and running 10x slower than manual complete unrolling
           Product: gcc
           Version: 8.1.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: elrodc at gmail dot com
  Target Milestone: ---

I wasn't sure where to put this.
I posted in the Fortran gcc mailing list initially, but was redirected to
bugzilla.
I specified RTL-optimization as the component, because the manually unrolled
version avoids register spills yet has 13 (unnecessary?) vmovapd instructions
between registers, and the loop version is a behemoth of moving data in, out,
and between registers.

The failure of the loop might also fall under tree optimization?

For that reason, completely unrolling the loop actually results in over 3x less
assembly than the loop. Unfortunately, funroll-loops did not complete unroll,
making the manual unrolling necessary.
Assembly is identical whether or not funroll-loops is used.
Adding the directive: 
   !GCC$ unroll 31
does lead to complete unrolling, but also use of xmm registers instead of zmm,
and thus massive amounts of spilling (and probably extremely slow code -- did
not benchmark).

Here is the code (a 16x32 * 32x14 matrix multiplication kernel for avx-512 [the
32 is arbitrary]), sans directive:
https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.f90

I compiled with:
gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -funroll-loops
-S -shared -fPIC kernels.f90 -o kernels.s

resulting in this assembly (without the directive):
https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.s



The manually unrolled version has 13 vmovapd instructions that look unnecessary
(like a vfmadd should've been able to place the answer in the correct
location?). 8 of them move from one register to another, and 5 look something
like:
vmovapd    %zmm20, 136(%rsp)


I suspect there should ideally be 0 of these?
If not, I'd be interested in learning more about why.
This at least seems like an RTL optimization bug/question.

The rest of the generated code looks great to me. Repeated blocks of only:
2x vmovupd
7x vbroadcastsd
14x vfmadd231pd



In the looped code, however, the `vfmadd231pd` instructions are a rare sight
between all the register management. The loop code begins at line 1475 in the
assembly file.

While the manually unrolled code benchmarked at 135ns, the looped version took
1.4 microseconds on my computer.

Trying to understand more about what it's doing:
- While the manually unrolled code has the expected 868 = (16/8)*(32-1)*14
vfmadds for the fully unrolled code, the looped version has two blocks of 224 =
(16/8)*X*14, where X = 8, indicating it is partially unrolling the loop.
One of them is using xmm registers instead of zmm, so it looks like the
compiler mistakenly things smaller vectors may be needed to clean up something?

(Maybe it is trying to vectorize across loop iterations, rather than within, in
some weird way? I don't know why it'd be using all those vpermt2pd, otherwise.)

[Bug rtl-optimization/86625] New: funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling

Reply via email to