https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42612
Dmitry Baksheev <bd at mail dot ru> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |bd at mail dot ru --- Comment #6 from Dmitry Baksheev <bd at mail dot ru> --- Please consider fixing this issue. Here is another example where not using post-increment for loops produces suboptimal code on AArch64. The code is 4x slower than LLVM-generated code for dot-product function: double dotprod(std::size_t n, const double* __restrict__ a, const double* __restrict__ b) { double ans = 0; #if __clang__ #pragma clang loop vectorize(assume_safety) #else #pragma GCC ivdep #endif for (std::size_t i = 0; i < n; ++i) { ans += a[i] * b[i]; } return ans; } Compile with: $(CXX) -march=armv8.2-a -O3 dp.cpp GCC-generated loop does not have post-increment loads: .L3: ldr d2, [x1, x3, lsl 3] ldr d1, [x2, x3, lsl 3] add x3, x3, 1 fmadd d0, d2, d1, d0 cmp x0, x3 bne .L3 Clang emits this: .LBB0_4: ldr d1, [x10], #8 ldr d2, [x8], #8 subs x9, x9, #1 fmadd d0, d1, d2, d0 b.ne .LBB0_4