15 regression] fma not always used in complex product

vincenzo.innocente at cern dot ch via Gcc-bugs Tue, 08 Oct 2024 01:42:58 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116979


--- Comment #7 from vincenzo Innocente <vincenzo.innocente at cern dot ch> ---
hadcrafted x86 code would look like this
scalar code:

       vmovss  xmm0, dword ptr [rdi]
       vmovss  xmm1, dword ptr [rdi + 4]
       vmovss  xmm2, dword ptr [rsi]
       vmovss  xmm3, dword ptr [rsi + 4]
       vmulss  xmm4, xmm3, xmm1
       vmulss  xmm1, xmm2, xmm1
       vfmsub231ss     xmm4, xmm0, xmm2
       vfmadd231ss     xmm1, xmm3, xmm0
       vinsertps       xmm0, xmm4, xmm1, 16
       ret

and vector code:

       vmovsd    xmm3, QWORD PTR [rdi]
       vmovshdup xmm1, QWORD PTR [rsi]               
       vmovsldup xmm0, QWORD PTR [rsi]
       vshufps   xmm2, xmm3, xmm3, 177      
       vmulps    xmm4, xmm1, xmm2  
       vfmaddsub213ps xmm0, xmm3, xmm4
       ret

scalar:  4 loads, 2 multiplies, 2 FMA
vector:  3 loads, 1 shuffle, 1 multiply, 1 FMA

Note that the hardware instructions vmovshdup and vmovsldup use only the load
ports.

so the vector code should be even faster with the use of fma

[Bug target/116979] [12/13/14/15 regression] fma not always used in complex product

Reply via email to