https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117438

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
>this may cause significant performance regression of some nested loops.

I suspect it depends on the micro-arch for the x86 target.

What are you running the test on?

        .p2align 6
.L3:

I notice GCC aligns only the inner loop to 64 byte boundary while clang/LLVM
aligns each loop (inner and outer) loops to 16 byte boundary.

Reply via email to