https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116145

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tnfchris at gcc dot gnu.org

--- Comment #3 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
We've looked at this type of constant initialization in the past and even
though the LLVM version looks faster it's not in practice.

If you look at the Software Optimization Guides SVE cores don't handle MOV/MOVK
pairs special anymore.
So here the sequence is a 3 instruction dependency chain with a very low
throughput:

        mov     w8, #23615
        movk    w8, #2573, lsl #16
        mov     z0.s, w8
        ret

vs

        ptrue   p3.b, all
        adrp    x0, .LC0
        add     x0, x0, :lo12:.LC0
        ld1rw   z0.s, p3/z, [x0]
        ret

which is also a 3 instruction dependency chain but loads have a higher
throughputs than register transfers and the latency difference is hidden.
In most real code you'd also have shared the anchor and ptrue, or if in a loop,
the ptrue and the adrp would have been floated out.

Benchmarking has shown that there's no real performance difference between
these two when it's 1 constant.  When there are more than one constant the load
variant wins by a large margin as the SVE mov serializes the construction of
all constants.

The concern here is that because of this serialization that constant
rematerialization inside loops would become slower.
So I don't believe the LLVM sequence is beneficial to implement.

That said, when we looked at this we did come to the conclusion that we can use
SVE's ORR and other immediate instructions to construct more immediate
sequences on the SIMD side itself.  That way we avoid the transfer.

Reply via email to