[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

already5chosen at yahoo dot com via Gcc-bugs Thu, 24 Sep 2020 01:28:29 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127


--- Comment #10 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Hongtao.liu from comment #9)
> (In reply to Michael_S from comment #8)
> > What are values of gcc "loop" cost of the relevant instructions now?
> > 1. AVX256 Load
> > 2. FMA3 ymm,ymm,ymm
> > 3. AVX256 Regmove
> > 4. FMA3 mem,ymm,ymm
> 
> For skylake, outside of register allocation.
> 
> they are
> 1. AVX256 Load  ---- 10
> 2. FMA3 ymm,ymm,ymm --- 16
> 3. AVX256 Regmove  --- 2
> 4. FMA3 mem,ymm,ymm --- 32
> 
> In RA, no direct cost for fma instrcutions, but we can disparage memory
> alternative in FMA instructions， but again, it may hurt performance in some
> cases.
> 
> 1. AVX256 Load  ---- 10
> 3. AVX256 Regmove  --- 2
> 
> BTW: we have done a lot of experiments with different cost models and no
> significant performance impact on SPEC2017.

Thank you.
With relative costs like these gcc should generate 'FMA3 mem,ymm,ymm' only in
conditions of heavy registers pressure. So, why it generates it in my loop,
where registers pressure in the innermost loop is light and even in the next
outer level the pressure isn't heavy?
What am I missing?

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

Reply via email to