https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99633
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Ever confirmed|0 |1 Keywords| |missed-optimization Target| |x86_64-*-* Status|UNCONFIRMED |NEW Last reconfirmed| |2021-03-18 --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- I guess a heuristic could be to use the available load/store bandwith (for streaming loads/stores only) when load/store 'uops' (stmts/insns) dominate the loop. In the case of this loop we don't even need an epilogue so that's a plus as well. The inner loop could also be split at LEN_1D/2 to make the load of a[LEN_1D/2] invariant in all but a single iteration (possibly not worth the trouble in this case).