https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94863
--- Comment #3 from Gabriel Ravier <gabravier at gmail dot com> --- For binary size, the `movsd` takes 4 bytes and the `blendps` takes 6 bytes The port allocations for the instructions are as such (same formatting as for the throughputs) : Wolfdale: p5, p015 Nehalem: p5, p5 Westmere: p5, p5 Sandy Bridge: p05, p5 Ivy Bridge: p05, p5 Haswell: p015, p5 Broadwell: p015, p5 Skylake: p015, p5 Skylake-X: p015, p5 Kaby Lake: p015, p5 Coffee Lake: p015, p5 Cannon Lake: p015, p015 Ice Lake: p015, p015 Zen+: fp01, fp0123 Zen 2: fp013, fp0123 Something like "p015" meaning that the instruction can be executed on port 0, 1 or 5. Also, all architectures have both instructions take a single uop. The latency of `blendps` and `movsd` are 1 on every single architecture I could test Final note : The numbers are specifically for the `blendps xmm, xmm, imm8` and the `movsd xmm, xmm` forms of those instructions