https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833
--- Comment #4 from Peter Cordes <peter at cordes dot ca> --- I don't think it's worth anyone's time to implement this in 2017, but using MMX regs for 64-bit store/load would be faster on really old CPUs that split 128b vectors insns into two halves, like K8 and Pentium M. Especially with -mno-sse2 (e.g. Pentium3 compat) where movlps has a false dependency on the old value of the xmm reg, but movq mm0 doesn't. (No SSE2 means we can't MOVQ or MOVSD to an XMM reg). MMX is also a saving in code-size: one fewer prefix byte vs. SSE2 integer instructions. It's also another set of 8 registers for 32-bit mode. But Skylake has lower throughput for the MMX versions of some instructions than for the XMM version. And SSE4 instructions like PEXTRD don't have MMX versions, unlike SSSE3 and earlier (e.g. pshufb mm0, mm1 is available, and on Conroe it's faster than the xmm version).