https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116684
ktkachov at gcc dot gnu.org changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Ever confirmed|0 |1
Last reconfirmed| |2024-09-11
CC| |ktkachov at gcc dot gnu.org,
| |tnfchris at gcc dot gnu.org
--- Comment #1 from ktkachov at gcc dot gnu.org ---
Indeed. Curiously, for aarch64 at -O2 GCC is smart enough to recognise a USDOT
instruction but at -O3 (-mcpu=neoverse-v2) it all gets synthesised with muls
and widening adds.
-O2:
.L2:
ldr s29, [x2, x3]
ld1b z27.b, p7/z, [x1, x3]
sel z27.b, p7, z27.b, z30.b
fmov s28, s29
movprfx z29, z31
insr z29.s, s28
ld1b z28.b, p7/z, [x0]
usdot z29.s, z28.b, z27.b
uaddv d29, p6, z29.s
str s29, [x2, x3]
add x3, x3, 4
cmp x3, 64
bne .L2
-O3:
ld4 {v24.16b - v27.16b}, [x1]
ldrb w3, [x0]
ldrb w1, [x0, 1]
ldp q29, q28, [x2]
dup v4.4h, w3
ldp q31, q30, [x2, 32]
dup v5.4h, w1
ldrb w1, [x0, 2]
sxtl v16.8h, v24.8b
sxtl2 v24.8h, v24.16b
ldrb w0, [x0, 3]
sxtl v17.8h, v25.8b
sxtl2 v25.8h, v25.16b
sxtl v18.8h, v26.8b
dup v6.4h, w1
sxtl2 v26.8h, v26.16b
sxtl v19.8h, v27.8b
mul v24.8h, v24.8h, v4.h[0]
dup v7.4h, w0
mul v20.8h, v16.8h, v4.h[0]
sxtl2 v27.8h, v27.16b
mul v21.8h, v17.8h, v5.h[0]
mul v25.8h, v25.8h, v5.h[0]
saddw v31.4s, v31.4s, v24.4h
mul v23.8h, v18.8h, v6.h[0]
saddw2 v30.4s, v30.4s, v24.8h
saddw v29.4s, v29.4s, v20.4h
mul v26.8h, v26.8h, v6.h[0]
saddw2 v28.4s, v28.4s, v20.8h
mul v24.8h, v19.8h, v7.h[0]
saddw v29.4s, v29.4s, v21.4h
saddw2 v28.4s, v28.4s, v21.8h
saddw v31.4s, v31.4s, v25.4h
mul v27.8h, v27.8h, v7.h[0]
saddw2 v30.4s, v30.4s, v25.8h
saddw v29.4s, v29.4s, v23.4h
saddw2 v28.4s, v28.4s, v23.8h
saddw v31.4s, v31.4s, v26.4h
saddw2 v30.4s, v30.4s, v26.8h
saddw v29.4s, v29.4s, v24.4h
saddw2 v28.4s, v28.4s, v24.8h
saddw v31.4s, v31.4s, v27.4h
saddw2 v30.4s, v30.4s, v27.8h
stp q29, q28, [x2]
stp q31, q30, [x2, 32]
The O3 version does fully unroll the loop so it's probably better but maybe it
could do a better job of using USDOT?