https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154
--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> --- Note that on Haswell the conditional moves are two uops while on Broadwell and up they are only one uop (overall loop 16 uops vs. 18 uops). IACA doesn't show any particular issue (the iterations shoud neatly interleave w/o inter iteration dependences), but it says the throughput bottleneck is dependency chains (not sure if it models conditional moves very well).