https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117438
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> --- >this may cause significant performance regression of some nested loops. I suspect it depends on the micro-arch for the x86 target. What are you running the test on? .p2align 6 .L3: I notice GCC aligns only the inner loop to 64 byte boundary while clang/LLVM aligns each loop (inner and outer) loops to 16 byte boundary.