https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> --- Even -fno-ivopts produces better code: .L3: add x0, x5, x1, sxtw add w1, w1, 1 ldr d2, [x3], 124 ldr w0, [x4, x0, lsl 2] dup v0.2s, w0 mla v1.2s, v0.2s, v2.2s subs w2, w2, #1 bne .L3 Compared with: .L3: lsl x2, x0, 5 add x1, x0, x4 sub x2, x2, x0 add x1, x1, x5 ldr d2, [x3, x2] add x0, x0, 4 ldr w1, [x1, 4] dup v0.2s, w1 mla v1.2s, v0.2s, v2.2s cmp x0, 120 bne .L3 But I think the main reason for the performance regression is: ldr w1, [x1, 4] dup v0.2s, w1 If the compiler had used ldr1 instead the performance would be back to normal.