https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84490
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> --- Created attachment 43896 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43896&action=edit r254011 with peeling disabled The other differences look like RA/scheduling in the end the stack frame in the new rev. is 32 bytes larger (up from $4800 to $4832). Disabling the 2nd scheduling pass doesn't have any nice effects btw. All the spills in the code certainly makes for bad code so I'm not sure that trying to fix things by re-introducing the peeling for alignment somehow makes most sense... Looking for an opportunity to distribute the loop might make more sense, eventually more explicitely "spilling" shared intermediate results to memory in distribution. The source is quite unwieldly and dependences are not obvious here.