https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80570
--- Comment #3 from Peter Cordes <peter at cordes dot ca> --- (In reply to Andrew Pinski from comment #2) > Even on aarch64: > > .L2: > ldr q0, [x1], 16 > sxtl v1.2d, v0.2s > sxtl2 v0.2d, v0.4s > scvtf v1.2d, v1.2d > scvtf v0.2d, v0.2d > stp q1, q0, [x0] > > But the above is decent really. More that decent, that's what we *should* be doing, I think. AArch64 has versions of most instructions that read the top of a vector, unlike x86-64 where VPMOVZX / SX can only read from the bottom half. That's the key difference, and what makes this strategy good on ARM, bad on x86-64. (On 32-bit ARM, you load a q register, then read the two halves separately as 64-bit d<0..31> registers. AArch64 changed that so there are 32x 128-bit vector regs, and no partial regs aliasing the high half. But they provide OP, OP2 versions of some instructions that widen or things like that, with the "2" version accessing a high half. Presumably part of the motivation is to make it easier to port ARM NEON code that depended on accessing halves of a 128-bit q vector using its d regs. But it's a generally reasonable design and could also be motivated by seeing how inconvenient things get in SSE and AVX for pmovsx/zx.) Anyway, AArch64 SIMD is specifically designed to make it fully efficient to do wide loads and then unpack both halves, like is possible in ARM, but not x86-64. It's also using a store (of a pair of regs) that's twice the width of the load. But even if it was using a max-width load of a pair of 128-bit vectors (and having to store two pairs) that would be good, just effectively unrolling. But GCC sees it as one load and two separate stores, that it just happens to be able to combine as a pair.