https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010

--- Comment #10 from Manolis Tsamis <manolis.tsamis at vrull dot eu> ---
(In reply to ptomsich from comment #9)
> (In reply to Manolis Tsamis from comment #0) 
> > E.g. another loop, non canonicalized names:
> > 
> > .L120:
> >     ldr     q30, [x0], 16
> >     movi    v29.2s, 0
> >     ld2     {v26.16b - v27.16b}, [x4], 32
> >     movi    v25.4s, 0
> >     zip1    v29.16b, v30.16b, v29.16b
> >     zip2    v30.16b, v30.16b, v25.16b
> >     umlal   v29.8h, v26.8b, v28.8b
> >     umlal2  v30.8h, v26.16b, v28.16b
> >     uaddw   v31.4s, v31.4s, v29.4h
> >     uaddw2  v31.4s, v31.4s, v29.8h
> >     uaddw   v31.4s, v31.4s, v30.4h
> >     uaddw2  v31.4s, v31.4s, v30.8h
> >     cmp     x5, x0
> >     bne     .L120
> 
> Is it just me, or are the zip1 and zip2 instructions dead?
> 
> Philipp.

They certainly look dead, but they're not because the umlal/umlal2 (and other
accumulate instructions) also read from the destination register.

There looks to be a missed optimization opportunity to use just a single `movi
v25.4s, 0` here though.

Also, looking again at the generated code in the first example:

        mov     v23.16b, v18.16b
        mla     v23.16b, v17.16b, v25.16b

If I'm correct this could be folded into just

        mla     v18.16b, v17.16b, v25.16b

In which case most of the movs in the first and second example could be
eliminated. To me it looks like the accumulate instructions are missing some
optimizations.

Reply via email to