https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834
--- Comment #7 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> --- Thanks for looking at this. (In reply to kugan from comment #6) > cmp w3, 0 > ble .L1 > sub w3, w3, #1 > mov x4, 0 > cntw x5 > ptrue p1.s, all > lsr w3, w3, 1 > add w3, w3, 1 > whilelo p0.s, xzr, x3 > .p2align 3,,7 > .L3: > ld2w {z4.s - z5.s}, p0/z, [x1, x4, lsl 2] > ld2w {z2.s - z3.s}, p0/z, [x2, x4, lsl 2] > add z0.s, z4.s, z2.s > sub z1.s, z5.s, z3.s > st2w {z0.s - z1.s}, p0, [x0, x4, lsl 2] > whilelo p0.s, x5, x3 > incb x4, all, mul #2 > incw x5 > ptest p1, p0.b > bne .L3 > .L1: > ret > .cfi_endproc This doesn't look right. x4 is an index, so it should be incremented by the number of words in two vectors, rather than the number of bytes in two vectors.