> I think to make a difference you need to hit the number of parallel fadd/fmul > the pipeline can perform. I don't think issue width is ever a problem for > chains w/o fma and throughput of fma vs fadd + fmul should be similar. >
Yes, for x86 backend, fadd , fmul and fma have the same TP meaning they should have the same width. The current implementation is reasonable /* reassoc int, fp, vec_int, vec_fp. */. > That said, I think iff then we should try to improve > rewrite_expr_tree_parallel rather than adding a new function. For example > for the case with equal rank operands we can try to sort adds first. I can't > convince myself that rewrite_expr_tree_parallel honors ranks properly > quickly. > I rewrite this patch, there are mainly two changes: 1. I made some changes to rewrite_expr_tree_parallel_for_fma and used it instead of rewrite_expr_tree_parallel. The following example shows that the sequence generated by the this patch is better. 2. Put no-mult ops and mult ops alternately at the end of the queue, which is conducive to generating more fma and reducing the loss of FMA when breaking the chain. With these two changes, GCC can break the chain with width = 2 and generates 6 FMAs for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350 without any params. ------------------------------------------------------------------------------------------------------------------ Source codeļ¼ g + h + j + s + m + n+a+b +e (https://godbolt.org/z/G8sb86n84) Compile options: -Ofast -mfpmath=sse -mfma Width = 3 was chosen for reassociation ----------------------------------------------------------------------------------------------------------------- Old rewrite_expr_tree_parallel generates: _6 = g_8(D) + h_9(D); ------> parallel 0 _3 = s_11(D) + m_12(D); ------> parallel 1 _5 = _3 + j_10(D); _2 = n_13(D) + a_14(D); ------> parallel 2 _1 = b_15(D) + e_16(D); -----> Parallel 3, This is not necessary, and it is not friendly to FMA. _4 = _1 + _2; _7 = _4 + _5; _17 = _6 + _7; return _17; When the width = 3, we need 5 cycles here. ---------------------------------------------first end--------------------------------------------------------- Rewrite the old rewrite_expr_tree_parallel (3 sets in parallel) generates: _3 = s_11(D) + m_12(D); ------> parallel 0 _5 = _3 + j_10(D); _2 = n_13(D) + a_14(D); ------> parallel 1 _1 = b_15(D) + e_16(D); ------> parallel 2 _4 = _1 + _2; _6 = _4 + _5; _7 = _6 + h_9(D); _17 = _7 + g_8(D); return _17; When the width = 3, we need 5 cycles here. ---------------------------------------------second end------------------------------------------------------- Use rewrite_expr_tree_parallel_for_fma instead of rewrite_expr_tree_parallel generates: _3 = s_11(D) + m_12(D); _6 = _3 + g_8(D); _2 = n_13(D) + a_14(D); _5 = _2 + h_9(D); _1 = b_15(D) + e_16(D); _4 = _1 + j_10(D); _7 = _4 + _5; _17 = _7 + _6; return _17; When the width = 3, we need 4 cycles here. --------------------------------------------third end----------------------------------------------------------- Thanks, Lili.