https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #11 from ktkachov at gcc dot gnu.org --- Thank you all for the input. Just to add a bit of data. I've instrumented 510.parest_r to count the number of loop iterations to get a feel for how much of the unrolled loop is spent in the actual unrolled part rather than the prologue/peeled part. Overall, the hot function itself is entered 290M times. The distribution of loop iteration counts is: Frequency iter: 92438870 36 87028560 54 20404571 24 17312960 62 14237184 72 13403904 108 7574437 102 7574420 70 5564881 40 4328249 64 4328240 46 3142656 48 2666496 124 1248176 8 1236641 16 1166592 204 1166592 140 1134392 4 857088 80 666624 92 666624 128 618320 30 613056 1 234464 2 190464 32 95232 60 84476 20 48272 10 6896 5 So the two most common iteration counts are 36 and 54. For an 8x unrolled loop that's 4 and 6 iterations spent in the prologue with 4 and 6 times going around the 8x unrolled loop respectively. As an experiment I hacked the AArch64 assembly of the function generated with -funroll-loops to replace the peeled prologue version with a simple non-unrolled loop. That gave a sizeable speedup on two AArch64 platforms: >7%. So beyond the vectorisation point Richard S. made above, maybe it's worth considering replacing the peeled prologue with a simple loop instead? Or at least add that as a distinct unrolling strategy and work to come up with an analysis that would allow us to choose one over the other?