https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #35 from rguenther at suse dot de <rguenther at suse dot de> --- On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #34 from Wilco <wilco at gcc dot gnu.org> --- > (In reply to rguent...@suse.de from comment #30) > > On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote: > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > > > > > --- Comment #29 from Wilco <wilco at gcc dot gnu.org> --- > > > (In reply to Jiu Fu Guo from comment #28) > > > > For these kind of small loops, it would be acceptable to unroll in > > > > GIMPLE, > > > > because register pressure and instruction cost may not be major > > > > concerns; > > > > just like "cunroll" and "cunrolli" passes (complete unroll) which also > > > > been > > > > done at O2. > > > > > > Absolutely, unrolling is a high-level optimization like vectorization. > > > > To expose ILP? I'd call that low-level though ;) > > > > If it exposes data reuse then I'd call it high-level - and at that level > > we already have passes like predictive commoning or unroll-and-jam doing > > exactly that. Or vectorization. > > > > We've shown though data that unrolling without a good idea on CPU > > pipeline details is a loss on x86_64. This further hints at it > > being low-level. > > That was using the RTL unroller and existing settings right? Those settings > are > not sane - is 16x unrolling still the default?!? If so, that doesn't surprise > me at all. We actually did measurements with restricting that to 2x, 3x, 4x, ... plus selectively enabling some extra unroller features (fsplit-ivs-in-unroller, fvariable-expansion-in-unroller). Quite a big testing matrix but then nothing was an obvious win. > Note also it's dangerous to assume x86 behaves exactly like Arm, POWER > or other ISAs. Arm cores are very wide nowadays (6+ wide) and have > supported 2 memory accesses per cycle (loads and/or stores) for years. Sure.