https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833
--- Comment #14 from Peter Cordes <peter at cordes dot ca> --- I happened to look at this old bug again recently. re: extracting high the low two 32-bit elements: (In reply to Uroš Bizjak from comment #11) > > Or without SSE4 -mtune=sandybridge (anything that excluded Nehalem and other > > CPUs where an FP shuffle has bypass delay between integer ops) > > > > movd %xmm0, %eax > > movshdup %xmm0, %xmm0 # saves 1B of code-size vs. psrldq, I think. > > movd %xmm0, %edx > > > > Or without SSE3, > > > > movd %xmm0, %eax > > psrldq $4, %xmm0 # 1 m-op cheaper than pshufd on K8 > > movd %xmm0, %edx > > The above two proposals are not suitable for generic moves. We should not > clobber input value, and we are not allowed to use temporary. SSE3 movshdup broadcasts the high element within each pair of 32-bit elements so movshdup %xmm0, %xmm1 movd %xmm1, %eax saves a byte of code vs pshufd / movd, and saves a uop on Merom and avoids a flt->int. (According to Agner Fog's tables, pshufd is flt->int domain, i.e. it wants input in the float domain. While movshdup ironically is only an integer shuffle.) Probably not worth looking for that optimization, though, because it's not worth using universally (Nehalem has worse latency for float shuffles between int instructions). With just SSE2, PSHUFLW is the same size as PSHUFD and faster on Merom / K8 (slowshuffle CPUs where PSHUFD is multiple uops). It's not slower on any current CPUs. I could imagine some future CPU having better throughput for 32-bit element size shuffles than 16-bit, though. That's already the case for wider lane-crossing shuffles (VPERMW YMM is multiple uops on Skylake-AVX512). This would be a definite win for tune=core2 or k8, and Pentium M, but those are so old it's probably not worth adding extra code to look for it. I think it's pretty future-proof, though, unless Intel or AMD add an extra shuffle unit for element sizes of 32-bit or wider on another port.