https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42612

Dmitry Baksheev <bd at mail dot ru> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bd at mail dot ru

--- Comment #6 from Dmitry Baksheev <bd at mail dot ru> ---
Please consider fixing this issue. Here is another example where not using
post-increment for loops produces suboptimal code on AArch64. The code is 4x
slower than LLVM-generated code for dot-product function:

    double dotprod(std::size_t n, 
         const double* __restrict__ a, 
         const double* __restrict__ b) 
    {
        double ans = 0;
        #if __clang__
        #pragma clang loop vectorize(assume_safety)
        #else
        #pragma GCC ivdep
        #endif  
        for (std::size_t i = 0; i < n; ++i) {
            ans += a[i] * b[i];
        }
        return ans;
    }


Compile with: $(CXX) -march=armv8.2-a -O3 dp.cpp

GCC-generated loop does not have post-increment loads:
    .L3:                                                                        
        ldr d2, [x1, x3, lsl 3]                                                 
        ldr d1, [x2, x3, lsl 3]                                                 
        add x3, x3, 1                                                           
        fmadd   d0, d2, d1, d0                                                  
        cmp x0, x3                                                              
        bne .L3 

Clang emits this:
    .LBB0_4:
        ldr d1, [x10], #8                                                       
        ldr d2, [x8], #8                                                        
        subs    x9, x9, #1
        fmadd   d0, d1, d2, d0                                                  
        b.ne    .LBB0_4

Reply via email to