https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116145
Tamar Christina <tnfchris at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |tnfchris at gcc dot gnu.org --- Comment #3 from Tamar Christina <tnfchris at gcc dot gnu.org> --- We've looked at this type of constant initialization in the past and even though the LLVM version looks faster it's not in practice. If you look at the Software Optimization Guides SVE cores don't handle MOV/MOVK pairs special anymore. So here the sequence is a 3 instruction dependency chain with a very low throughput: mov w8, #23615 movk w8, #2573, lsl #16 mov z0.s, w8 ret vs ptrue p3.b, all adrp x0, .LC0 add x0, x0, :lo12:.LC0 ld1rw z0.s, p3/z, [x0] ret which is also a 3 instruction dependency chain but loads have a higher throughputs than register transfers and the latency difference is hidden. In most real code you'd also have shared the anchor and ptrue, or if in a loop, the ptrue and the adrp would have been floated out. Benchmarking has shown that there's no real performance difference between these two when it's 1 constant. When there are more than one constant the load variant wins by a large margin as the SVE mov serializes the construction of all constants. The concern here is that because of this serialization that constant rematerialization inside loops would become slower. So I don't believe the LLVM sequence is beneficial to implement. That said, when we looked at this we did come to the conclusion that we can use SVE's ORR and other immediate instructions to construct more immediate sequences on the SIMD side itself. That way we avoid the transfer.