https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Tamar Christina from comment #4) > (In reply to Richard Biener from comment #3) > > The issue isn't unrolling but invariant motion. We unroll the innermost > > loop, vectorizer the middle loop and then unroll that as well. That leaves > > us with > > 64 invariant loads from b[] in the outer loop which I think RTL opts never > > "schedule back", even with -fsched-pressure. > > > > Aside from the loads, by fully unrolling the inner loop, that means we need > 16 unique registers live for the destination every iteration. That's > already half the SIMD register file on AArch64 gone, not counting the > invariant loads. Why? You can try -fno-tree-pre -fno-tree-loop-im -fno-predictive-commoning > > Estimating register pressure on GIMPLE is hard and we heavily rely on > > "optimistic" transforms with regard to things being optimized in followup > > passes during the GIMPLE phase. > > Understood, but if we can't do it automatically, is there a way to tell the > unroller not to fully unroll this? Like you did ... > The #pragma GCC unroll 8 doesn't work as that seems to stop GIMPLE unrolling > and does it at RTL instead. ... because on GIMPLE we only can fully unroll or not. > At the moment a way for the user to locally control the unroll amount would > already be a good step. I know there's the param, but that's global and > typically the unroll factor would depend on the GEMM kernel. As said it should already work to the extent that on GIMPLE we do not perform classical loop unrolling.