https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Even -fno-ivopts produces better code:
.L3:
    add    x0, x5, x1, sxtw
    add    w1, w1, 1
    ldr    d2, [x3], 124
    ldr    w0, [x4, x0, lsl 2]
    dup    v0.2s, w0
    mla    v1.2s, v0.2s, v2.2s
    subs    w2, w2, #1
    bne    .L3

Compared with:
.L3:
    lsl    x2, x0, 5
    add    x1, x0, x4
    sub    x2, x2, x0
    add    x1, x1, x5
    ldr    d2, [x3, x2]
    add    x0, x0, 4
    ldr    w1, [x1, 4]
    dup    v0.2s, w1
    mla    v1.2s, v0.2s, v2.2s
    cmp    x0, 120
    bne    .L3

But I think the main reason for the performance regression is:
    ldr    w1, [x1, 4]
    dup    v0.2s, w1

If the compiler had used ldr1 instead the performance would be back to normal.

Reply via email to