https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625
Bug ID: 86625 Summary: funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling Product: gcc Version: 8.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: elrodc at gmail dot com Target Milestone: --- I wasn't sure where to put this. I posted in the Fortran gcc mailing list initially, but was redirected to bugzilla. I specified RTL-optimization as the component, because the manually unrolled version avoids register spills yet has 13 (unnecessary?) vmovapd instructions between registers, and the loop version is a behemoth of moving data in, out, and between registers. The failure of the loop might also fall under tree optimization? For that reason, completely unrolling the loop actually results in over 3x less assembly than the loop. Unfortunately, funroll-loops did not complete unroll, making the manual unrolling necessary. Assembly is identical whether or not funroll-loops is used. Adding the directive: !GCC$ unroll 31 does lead to complete unrolling, but also use of xmm registers instead of zmm, and thus massive amounts of spilling (and probably extremely slow code -- did not benchmark). Here is the code (a 16x32 * 32x14 matrix multiplication kernel for avx-512 [the 32 is arbitrary]), sans directive: https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.f90 I compiled with: gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -funroll-loops -S -shared -fPIC kernels.f90 -o kernels.s resulting in this assembly (without the directive): https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.s The manually unrolled version has 13 vmovapd instructions that look unnecessary (like a vfmadd should've been able to place the answer in the correct location?). 8 of them move from one register to another, and 5 look something like: vmovapd %zmm20, 136(%rsp) I suspect there should ideally be 0 of these? If not, I'd be interested in learning more about why. This at least seems like an RTL optimization bug/question. The rest of the generated code looks great to me. Repeated blocks of only: 2x vmovupd 7x vbroadcastsd 14x vfmadd231pd In the looped code, however, the `vfmadd231pd` instructions are a rare sight between all the register management. The loop code begins at line 1475 in the assembly file. While the manually unrolled code benchmarked at 135ns, the looped version took 1.4 microseconds on my computer. Trying to understand more about what it's doing: - While the manually unrolled code has the expected 868 = (16/8)*(32-1)*14 vfmadds for the fully unrolled code, the looped version has two blocks of 224 = (16/8)*X*14, where X = 8, indicating it is partially unrolling the loop. One of them is using xmm registers instead of zmm, so it looks like the compiler mistakenly things smaller vectors may be needed to clean up something? (Maybe it is trying to vectorize across loop iterations, rather than within, in some weird way? I don't know why it'd be using all those vpermt2pd, otherwise.)