[Bug tree-optimization/88760] GCC unrolling is suboptimal

rguenther at suse dot de Fri, 11 Oct 2019 04:43:49 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760


--- Comment #35 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #34 from Wilco <wilco at gcc dot gnu.org> ---
> (In reply to rguent...@suse.de from comment #30)
> > On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> > > 
> > > --- Comment #29 from Wilco <wilco at gcc dot gnu.org> ---
> > > (In reply to Jiu Fu Guo from comment #28)
> > > > For these kind of small loops, it would be acceptable to unroll in 
> > > > GIMPLE,
> > > > because register pressure and instruction cost may not be major 
> > > > concerns;
> > > > just  like "cunroll" and "cunrolli" passes (complete unroll) which also 
> > > > been
> > > > done at O2.
> > > 
> > > Absolutely, unrolling is a high-level optimization like vectorization.
> > 
> > To expose ILP?  I'd call that low-level though ;)
> > 
> > If it exposes data reuse then I'd call it high-level - and at that level
> > we already have passes like predictive commoning or unroll-and-jam doing
> > exactly that.  Or vectorization.
> > 
> > We've shown though data that unrolling without a good idea on CPU
> > pipeline details is a loss on x86_64.  This further hints at it
> > being low-level.
> 
> That was using the RTL unroller and existing settings right? Those settings 
> are
> not sane - is 16x unrolling still the default?!? If so, that doesn't surprise
> me at all.

We actually did measurements with restricting that to 2x, 3x, 4x, ...
plus selectively enabling some extra unroller features
(fsplit-ivs-in-unroller, fvariable-expansion-in-unroller).  Quite a big
testing matrix but then nothing was an obvious win.

> Note also it's dangerous to assume x86 behaves exactly like Arm, POWER 
> or other ISAs. Arm cores are very wide nowadays (6+ wide) and have 
> supported 2 memory accesses per cycle (loads and/or stores) for years.

Sure.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

Reply via email to