https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94863
Bug ID: 94863 Summary: Failure to use blendps over mov when possible Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: gabravier at gmail dot com Target Milestone: --- typedef double v2df __attribute__((vector_size(16))); v2df move_sd(v2df a, v2df b) { v2df result = a; result[0] = b[0]; return result; } LLVM -O3 compiles this as such : move_sd(double __vector(2), double __vector(2)): # @move_sd(double __vector(2), double __vector(2)) blendps xmm0, xmm1, 3 # xmm0 = xmm1[0,1],xmm0[2,3] ret GCC gives this : move_sd(double __vector(2), double __vector(2)): movsd xmm0, xmm1 ret Using `blendps` here should be a worthy tradeoff. Here is a table of throughputs for various CPU architectures formatted as "arch-name: blendps-throughput, movsd-throughput" : Wolfdale: 1, 0.33 Nehalem: 1, 1 Westmere: 1, 1 Sandy Bridge: 0.5, 1 Ivy Bridge: 0.5, 1 Haswell: 0.33, 1 Broadwell: 0.33, 1 Skylake: 0.33, 1 Skylake-X: 0.33, 1 Kaby Lake: 0.33, 1 Coffee Lake: 0.33, 1 Cannon Lake: 0.33, 0.33 Ice Lake: 0.33, 0.33 Zen+: 0.5, 0.25 Zen 2: 0.33, 0.25 Unless there is an important factor other than thoughput that could affect this, this should improve performance or keep it identical on every architecture except Zen+