https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94870
Bug ID: 94870 Summary: Failure to use movhlps instead of seperated mov+unpckhpd Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: gabravier at gmail dot com Target Milestone: --- typedef double v2df __attribute__((vector_size(16))); v2df _mm_sqrt_sd(v2df a, v2df b) { v2df c = __builtin_ia32_sqrtpd((v2df){b[0], b[1]}); return (v2df){c[1], a[1]}; } With -O3, LLVM outputs : _mm_sqrt_sd(double __vector(2), double __vector(2)): sqrtpd xmm1, xmm1 movhlps xmm0, xmm1 # xmm0 = xmm1[1],xmm0[1] ret GCC outputs : _mm_sqrt_sd(double __vector(2), double __vector(2)): movapd xmm2, xmm0 sqrtpd xmm0, xmm1 unpckhpd xmm0, xmm2 ret unpckhpd and movhlps seem to have equivalent performance, so using movhlps to elide the extra movapd seems like it would make sense