https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116979
--- Comment #7 from vincenzo Innocente <vincenzo.innocente at cern dot ch> --- hadcrafted x86 code would look like this scalar code: vmovss xmm0, dword ptr [rdi] vmovss xmm1, dword ptr [rdi + 4] vmovss xmm2, dword ptr [rsi] vmovss xmm3, dword ptr [rsi + 4] vmulss xmm4, xmm3, xmm1 vmulss xmm1, xmm2, xmm1 vfmsub231ss xmm4, xmm0, xmm2 vfmadd231ss xmm1, xmm3, xmm0 vinsertps xmm0, xmm4, xmm1, 16 ret and vector code: vmovsd xmm3, QWORD PTR [rdi] vmovshdup xmm1, QWORD PTR [rsi] vmovsldup xmm0, QWORD PTR [rsi] vshufps xmm2, xmm3, xmm3, 177 vmulps xmm4, xmm1, xmm2 vfmaddsub213ps xmm0, xmm3, xmm4 ret scalar: 4 loads, 2 multiplies, 2 FMA vector: 3 loads, 1 shuffle, 1 multiply, 1 FMA Note that the hardware instructions vmovshdup and vmovsldup use only the load ports. so the vector code should be even faster with the use of fma