> I think to make a difference you need to hit the number of parallel fadd/fmul
> the pipeline can perform.  I don't think issue width is ever a problem for
> chains w/o fma and throughput of fma vs fadd + fmul should be similar.
> 

Yes, for x86 backend, fadd , fmul and fma have the same TP meaning they should 
have the same width. 
The current implementation is reasonable  /* reassoc int, fp, vec_int, vec_fp.  
*/.

> That said, I think iff then we should try to improve
> rewrite_expr_tree_parallel rather than adding a new function.  For example
> for the case with equal rank operands we can try to sort adds first.  I can't
> convince myself that rewrite_expr_tree_parallel honors ranks properly
> quickly.
> 

I rewrite this patch, there are mainly two changes:
1. I made some changes to rewrite_expr_tree_parallel_for_fma and used it 
instead of rewrite_expr_tree_parallel. The following example shows that the 
sequence generated by the this patch is better.
2. Put no-mult ops and mult ops alternately at the end of the queue, which is 
conducive to generating more fma and reducing the loss of FMA when breaking the 
chain.
  
With these two changes, GCC can break the chain with width = 2 and generates 6 
FMAs for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350  without any params.

------------------------------------------------------------------------------------------------------------------
Source code: g + h + j + s + m + n+a+b +e  (https://godbolt.org/z/G8sb86n84)
Compile options: -Ofast -mfpmath=sse -mfma
Width = 3 was chosen for reassociation
-----------------------------------------------------------------------------------------------------------------
Old rewrite_expr_tree_parallel generates:
  _6 = g_8(D) + h_9(D);       ------> parallel 0
  _3 = s_11(D) + m_12(D);  ------> parallel 1
  _5 = _3 + j_10(D);
  _2 = n_13(D) + a_14(D);   ------> parallel 2
  _1 = b_15(D) + e_16(D);  -----> Parallel 3, This is not necessary, and it is 
not friendly to FMA.
  _4 = _1 + _2;        
  _7 = _4 + _5;        
  _17 = _6 + _7;      
  return _17;

When the width = 3,  we need 5 cycles here.
---------------------------------------------first 
end---------------------------------------------------------
Rewrite the old rewrite_expr_tree_parallel (3 sets in parallel) generates:

  _3 = s_11(D) + m_12(D);  ------> parallel 0
  _5 = _3 + j_10(D);
  _2 = n_13(D) + a_14(D);   ------> parallel 1
  _1 = b_15(D) + e_16(D);   ------> parallel 2
  _4 = _1 + _2;
  _6 = _4 + _5;
  _7 = _6 + h_9(D);
  _17 = _7 + g_8(D); 
  return _17;

When the width = 3, we need 5 cycles here.
---------------------------------------------second 
end-------------------------------------------------------
Use rewrite_expr_tree_parallel_for_fma instead of rewrite_expr_tree_parallel 
generates:

  _3 = s_11(D) + m_12(D);
  _6 = _3 + g_8(D);
  _2 = n_13(D) + a_14(D);
  _5 = _2 + h_9(D);
  _1 = b_15(D) + e_16(D);
  _4 = _1 + j_10(D);
  _7 = _4 + _5;
  _17 = _7 + _6;
  return _17;

When the width = 3, we need 4 cycles here.
--------------------------------------------third 
end-----------------------------------------------------------

Thanks,
Lili.

Reply via email to