https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #34 from Wilco <wilco at gcc dot gnu.org> --- (In reply to rguent...@suse.de from comment #30) > On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > > > --- Comment #29 from Wilco <wilco at gcc dot gnu.org> --- > > (In reply to Jiu Fu Guo from comment #28) > > > For these kind of small loops, it would be acceptable to unroll in GIMPLE, > > > because register pressure and instruction cost may not be major concerns; > > > just like "cunroll" and "cunrolli" passes (complete unroll) which also > > > been > > > done at O2. > > > > Absolutely, unrolling is a high-level optimization like vectorization. > > To expose ILP? I'd call that low-level though ;) > > If it exposes data reuse then I'd call it high-level - and at that level > we already have passes like predictive commoning or unroll-and-jam doing > exactly that. Or vectorization. > > We've shown though data that unrolling without a good idea on CPU > pipeline details is a loss on x86_64. This further hints at it > being low-level. That was using the RTL unroller and existing settings right? Those settings are not sane - is 16x unrolling still the default?!? If so, that doesn't surprise me at all. Note also it's dangerous to assume x86 behaves exactly like Arm, POWER or other ISAs. Arm cores are very wide nowadays (6+ wide) and have supported 2 memory accesses per cycle (loads and/or stores) for years.