https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Tamar Christina from comment #4)
> (In reply to Richard Biener from comment #3)
> > The issue isn't unrolling but invariant motion.  We unroll the innermost
> > loop, vectorizer the middle loop and then unroll that as well.  That leaves
> > us with
> > 64 invariant loads from b[] in the outer loop which I think RTL opts never
> > "schedule back", even with -fsched-pressure.
> > 
> 
> Aside from the loads, by fully unrolling the inner loop, that means we need
> 16 unique registers live for the destination every iteration.  That's
> already half the SIMD register file on AArch64 gone, not counting the
> invariant loads.

Why?  You can try -fno-tree-pre -fno-tree-loop-im -fno-predictive-commoning

> > Estimating register pressure on GIMPLE is hard and we heavily rely on
> > "optimistic" transforms with regard to things being optimized in followup
> > passes during the GIMPLE phase.
> 
> Understood, but if we can't do it automatically, is there a way to tell the
> unroller not to fully unroll this?

Like you did ...

> The #pragma GCC unroll 8 doesn't work as that seems to stop GIMPLE unrolling
> and does it at RTL instead.

... because on GIMPLE we only can fully unroll or not.

> At the moment a way for the user to locally control the unroll amount would
> already be a good step. I know there's the param, but that's global and
> typically the unroll factor would depend on the GEMM kernel.

As said it should already work to the extent that on GIMPLE we do not
perform classical loop unrolling.

Reply via email to