[Bug tree-optimization/88760] GCC unrolling is suboptimal

wilco at gcc dot gnu.org Fri, 11 Oct 2019 04:20:46 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760


--- Comment #34 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to rguent...@suse.de from comment #30)
> On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> > 
> > --- Comment #29 from Wilco <wilco at gcc dot gnu.org> ---
> > (In reply to Jiu Fu Guo from comment #28)
> > > For these kind of small loops, it would be acceptable to unroll in GIMPLE,
> > > because register pressure and instruction cost may not be major concerns;
> > > just  like "cunroll" and "cunrolli" passes (complete unroll) which also 
> > > been
> > > done at O2.
> > 
> > Absolutely, unrolling is a high-level optimization like vectorization.
> 
> To expose ILP?  I'd call that low-level though ;)
> 
> If it exposes data reuse then I'd call it high-level - and at that level
> we already have passes like predictive commoning or unroll-and-jam doing
> exactly that.  Or vectorization.
> 
> We've shown though data that unrolling without a good idea on CPU
> pipeline details is a loss on x86_64.  This further hints at it
> being low-level.

That was using the RTL unroller and existing settings right? Those settings are
not sane - is 16x unrolling still the default?!? If so, that doesn't surprise
me at all.

Note also it's dangerous to assume x86 behaves exactly like Arm, POWER or other
ISAs. Arm cores are very wide nowadays (6+ wide) and have supported 2 memory
accesses per cycle (loads and/or stores) for years.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

Reply via email to