https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010
--- Comment #10 from Manolis Tsamis <manolis.tsamis at vrull dot eu> --- (In reply to ptomsich from comment #9) > (In reply to Manolis Tsamis from comment #0) > > E.g. another loop, non canonicalized names: > > > > .L120: > > ldr q30, [x0], 16 > > movi v29.2s, 0 > > ld2 {v26.16b - v27.16b}, [x4], 32 > > movi v25.4s, 0 > > zip1 v29.16b, v30.16b, v29.16b > > zip2 v30.16b, v30.16b, v25.16b > > umlal v29.8h, v26.8b, v28.8b > > umlal2 v30.8h, v26.16b, v28.16b > > uaddw v31.4s, v31.4s, v29.4h > > uaddw2 v31.4s, v31.4s, v29.8h > > uaddw v31.4s, v31.4s, v30.4h > > uaddw2 v31.4s, v31.4s, v30.8h > > cmp x5, x0 > > bne .L120 > > Is it just me, or are the zip1 and zip2 instructions dead? > > Philipp. They certainly look dead, but they're not because the umlal/umlal2 (and other accumulate instructions) also read from the destination register. There looks to be a missed optimization opportunity to use just a single `movi v25.4s, 0` here though. Also, looking again at the generated code in the first example: mov v23.16b, v18.16b mla v23.16b, v17.16b, v25.16b If I'm correct this could be folded into just mla v18.16b, v17.16b, v25.16b In which case most of the movs in the first and second example could be eliminated. To me it looks like the accumulate instructions are missing some optimizations.