https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116979

--- Comment #7 from vincenzo Innocente <vincenzo.innocente at cern dot ch> ---
hadcrafted x86 code would look like this
scalar code:

       vmovss  xmm0, dword ptr [rdi]
       vmovss  xmm1, dword ptr [rdi + 4]
       vmovss  xmm2, dword ptr [rsi]
       vmovss  xmm3, dword ptr [rsi + 4]
       vmulss  xmm4, xmm3, xmm1
       vmulss  xmm1, xmm2, xmm1
       vfmsub231ss     xmm4, xmm0, xmm2
       vfmadd231ss     xmm1, xmm3, xmm0
       vinsertps       xmm0, xmm4, xmm1, 16
       ret

and vector code:

       vmovsd    xmm3, QWORD PTR [rdi]
       vmovshdup xmm1, QWORD PTR [rsi]               
       vmovsldup xmm0, QWORD PTR [rsi]
       vshufps   xmm2, xmm3, xmm3, 177      
       vmulps    xmm4, xmm1, xmm2  
       vfmaddsub213ps xmm0, xmm3, xmm4
       ret

scalar:  4 loads, 2 multiplies, 2 FMA
vector:  3 loads, 1 shuffle, 1 multiply, 1 FMA

Note that the hardware instructions vmovshdup and vmovsldup use only the load
ports.

so the vector code should be even faster with the use of fma

Reply via email to