https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833
--- Comment #2 from Peter Cordes <peter at cordes dot ca> --- On most CPUs, psrldq / movd is optimal for xmm[1] -> int without SSE4. On SnB-family, movd runs on port0, and psrldq can run on port5, so they can execute in parallel. (And the second movd can run the next cycle). I'd suggest using movd/psrldq/movd for -mtune=generic. (Or pshuflw to copy+shuffle if it's useful to not destroy the value in the xmm reg while extracting to integer. pshuflw is faster than pshufd on old CPUs, and the same on current CPUs). But for some CPUs, this is better: movd %xmm0, %eax psrlq $32, %xmm0 movd %xmm0, %edx A 64-bit shift by 32 is much better than PSRLDQ on some CPUs, especially SlowShuffle CPUs (where xmm pshufd is slower than 64-bit granularity shuffles). * P4: 2c latency instead of 4, and twice the throughput * Pentium M: 2 uops instead of 4. * Core2 merom/conroe: 1 uop instead of 2 * K8/K10: same as PSRLDQ