On Wed, May 17, 2023 at 3:05 PM Cui, Lili <lili....@intel.com> wrote: > > > I think to make a difference you need to hit the number of parallel > > fadd/fmul > > the pipeline can perform. I don't think issue width is ever a problem for > > chains w/o fma and throughput of fma vs fadd + fmul should be similar. > > > > Yes, for x86 backend, fadd , fmul and fma have the same TP meaning they > should have the same width. > The current implementation is reasonable /* reassoc int, fp, vec_int, > vec_fp. */. > > > That said, I think iff then we should try to improve > > rewrite_expr_tree_parallel rather than adding a new function. For example > > for the case with equal rank operands we can try to sort adds first. I > > can't > > convince myself that rewrite_expr_tree_parallel honors ranks properly > > quickly. > > > > I rewrite this patch, there are mainly two changes: > 1. I made some changes to rewrite_expr_tree_parallel_for_fma and used it > instead of rewrite_expr_tree_parallel. The following example shows that the > sequence generated by the this patch is better. > 2. Put no-mult ops and mult ops alternately at the end of the queue, which is > conducive to generating more fma and reducing the loss of FMA when breaking > the chain. > > With these two changes, GCC can break the chain with width = 2 and generates > 6 FMAs for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350 without any > params. > > ------------------------------------------------------------------------------------------------------------------ > Source code: g + h + j + s + m + n+a+b +e (https://godbolt.org/z/G8sb86n84) > Compile options: -Ofast -mfpmath=sse -mfma > Width = 3 was chosen for reassociation > ----------------------------------------------------------------------------------------------------------------- > Old rewrite_expr_tree_parallel generates: > _6 = g_8(D) + h_9(D); ------> parallel 0 > _3 = s_11(D) + m_12(D); ------> parallel 1 > _5 = _3 + j_10(D); > _2 = n_13(D) + a_14(D); ------> parallel 2 > _1 = b_15(D) + e_16(D); -----> Parallel 3, This is not necessary, and it > is not friendly to FMA. > _4 = _1 + _2; > _7 = _4 + _5; > _17 = _6 + _7; > return _17; > > When the width = 3, we need 5 cycles here. > ---------------------------------------------first > end--------------------------------------------------------- > Rewrite the old rewrite_expr_tree_parallel (3 sets in parallel) generates: > > _3 = s_11(D) + m_12(D); ------> parallel 0 > _5 = _3 + j_10(D); > _2 = n_13(D) + a_14(D); ------> parallel 1 > _1 = b_15(D) + e_16(D); ------> parallel 2 > _4 = _1 + _2; > _6 = _4 + _5; > _7 = _6 + h_9(D); > _17 = _7 + g_8(D); > return _17; > > When the width = 3, we need 5 cycles here. > ---------------------------------------------second > end------------------------------------------------------- > Use rewrite_expr_tree_parallel_for_fma instead of rewrite_expr_tree_parallel > generates: > > _3 = s_11(D) + m_12(D); > _6 = _3 + g_8(D); > _2 = n_13(D) + a_14(D); > _5 = _2 + h_9(D); > _1 = b_15(D) + e_16(D); > _4 = _1 + j_10(D); > _7 = _4 + _5; > _17 = _7 + _6; > return _17; > > When the width = 3, we need 4 cycles here. > --------------------------------------------third > end-----------------------------------------------------------
Yes, so what I was saying is that I doubt rewrite_expr_tree_parallel is optimal - you show that for the specific example rewrite_expr_tree_parallel_for_fma is better. I was arguing we want a single function, whether we single out leaves with multiplications or not. And we want documentation that shows the strategy will result in optimal latency (I think we should not sacrifice latency just for the sake of forming more FMAs). Richard. > > Thanks, > Lili. >