https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587
--- Comment #4 from Tamar Christina <tnfchris at gcc dot gnu.org> --- (In reply to Richard Biener from comment #3) > The issue isn't unrolling but invariant motion. We unroll the innermost > loop, vectorizer the middle loop and then unroll that as well. That leaves > us with > 64 invariant loads from b[] in the outer loop which I think RTL opts never > "schedule back", even with -fsched-pressure. > Aside from the loads, by fully unrolling the inner loop, that means we need 16 unique registers live for the destination every iteration. That's already half the SIMD register file on AArch64 gone, not counting the invariant loads. > Estimating register pressure on GIMPLE is hard and we heavily rely on > "optimistic" transforms with regard to things being optimized in followup > passes during the GIMPLE phase. Understood, but if we can't do it automatically, is there a way to tell the unroller not to fully unroll this? The #pragma GCC unroll 8 doesn't work as that seems to stop GIMPLE unrolling and does it at RTL instead. At the moment a way for the user to locally control the unroll amount would already be a good step. I know there's the param, but that's global and typically the unroll factor would depend on the GEMM kernel.