https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79534
--- Comment #7 from James Greenhalgh <jgreenhalgh at gcc dot gnu.org> --- I'm not sure there are any bugs here to fix, though I can still reproduce the performance differences. First up, basic block reordering causes an issue across all microarchitectures on which I've looked at this. Basic block reordering is kicking in because the static estimates of the execution profile make it look like a good idea. I'm struggling to understand exactly what the execution profile of the testcase is intended to be, as I'm finding both the source and the generated assembly/perf reports hard to follow. Because I'm struggling to follow it, I can't tell if the basic block reorganisation is sensible, but it doesn't look buggy. Turning basic block reordering off (with -fno-reorder-blocks) removes the performance difference for me, with that off both before and after r245151 have similar performance on Cortex-A53 and Cortex-A72. However, Cortex-A57 still shows a performance regression, which I believe is related to an extra conditional branch in the code after r245151. I tried to find which pass previously removed this branch and narrowed it down to jump2, but I haven't figured out why there is such a change in jump2. I'm on vacation now, so won't be able to look at this in the next week if anyone else wants to dig.