On Tue, Jul 11, 2023 at 5:00 AM Di Zhao OS via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > Attached is an updated version of the patch. > > Based on Philipp's review, some changes: > > 1. Defined new enum fma_state to describe the state of FMA candidates > for a list of operands. (Since the tests seems simple after the > change, I didn't add predicates on it.) > 2. Changed return type of convert_mult_to_fma_1 and convert_mult_to_fma > to tree, to remove the in/out parameter. > 3. Added description of return value values of rank_ops_for_fma.
I'll note that rank_ops_for_fma works on the single-use addition chain only so there might be FMA ops "elsewhere" in the dependence chain that are not in 'ops'. You are using defer_p = false for the fma_deferring_state so I wonder why you need this machinery at all? Wouldn't the restriction only apply when we'd actually deferred a FMA generation? I'm CCing Martin who did this work. But what I'm curious about is that if any of the new conditions trigger then you leave the ops chain alone - but it _could_ already be in "bad" shape, is there anything we could do to improve ordering? Also if (ops_mult.length () >= 2 && ops_mult.length () != ops_length) { + if (maybe_le (tree_to_poly_int64 (TYPE_SIZE (type)), + param_avoid_fma_max_bits)) + /* Avoid re-arrange to produce less FMA chains that can be slow. */ + return FMA_STATE_MULTIPLE; + /* Put no-mult ops and mult ops alternately at the end of the queue, which is conducive to generating more FMA and reducing the loss of FMA when breaking the chain. */ @@ -6829,9 +6909,9 @@ rank_ops_for_fma (vec<operand_entry *> *ops) if (opindex > 0) opindex--; } - return true; + return FMA_STATE_MULTIPLE; so we end up parallel rewriting in any case and just avoid "optimizing" the ops list here. >From the PR it looks like without the rewriting we are lucky that the FMA generation heuristic to avoid cross backedge FMA dependences doesn't trigger without associating (but parallel rewrite is still good)? For the NESTED case we avoid parallel rewriting completely, independent on param_avoid_fma_max_bits - it seems from the PR we want more FMAs here and the parallel rewriting makes it worse? But I don't see exactly how. I think these are all a bit fragile heuristics also give that reassoc works on single-use chains only. The more we interwind FMA creation in widen_mult and associating for FMA in reassoc the more I think that reassoc itself should form the FMAs? The passes are unfortunately quite a bit separated. Can you produce testcases for the two problematical cases in the PR? That said, the added heuristics are not looking universally good to me without better motivation. > --- > gcc/ChangeLog: > Missing PR tree-optimization/110279 so bugzilla picks up the commit. > * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Added new parameter > check_only_p. Changed return type to tree. > (struct fma_transformation_info): Moved to header. > (class fma_deferring_state): Moved to header. > (convert_mult_to_fma): Added new parameter check_only_p. Changed > return type to tree. > * tree-ssa-math-opts.h (struct fma_transformation_info): Moved from > .cc. > (class fma_deferring_state): Moved from .cc. > (convert_mult_to_fma): Add function decl. > * tree-ssa-reassoc.cc (enum fma_state): Defined new enum to describe > the state of FMA candidates for a list of operands. > (rewrite_expr_tree_parallel): Changed boolean parameter to enum type. > (rank_ops_for_fma): Return enum fma_state. > (reassociate_bb): Avoid rewriting to parallel if nested FMAs are > found. > > Thanks, > Di Zhao > >