On Wed, Aug 9, 2023 at 6:53 PM Di Zhao OS <diz...@os.amperecomputing.com> wrote: > > Hi, > > The previous version of this patch tries to solve two problems > at the same time. For better clarity, I'll separate them and > only deal with the "nested" FMA in this version. I plan to > propose another patch in avoiding bad shaped FMA (deferring FMA). > > Other changes: > > 1. Added new testcases for the "nested" FMA issue. For the > following code: > > tmp1 = a + c * c + d * d + x * y; > tmp2 = x * tmp1; > result += (a + c + d + tmp2); > > , when "tmp1 = ..." is not rewritten, tmp1 will be result of > an FMA, and there will be a list of consecutive FMAs: > > _1 = .FMA (c, c, a_39); > _2 = .FMA (d, d, _1); > tmp1 = .FMA (x, y, _2); > _3 = .FMA (tmp1, x, d); > ... > > If "tmp1 = ..." is rewritten to parallel, tmp1 will be result > of a PLUS_EXPR between FMAs: > > _1 = .FMA (c, c, a_39); > _2 = x * y; > _3 = .FMA (d, d, _2); > tmp1 = _3 + _1; > _4 = .FMA (tmp1, x, d); > ... > > It seems the register pressure of the latter is higher than > the former.
Yes, that's a natural consequence of rewriting to parallel. > On the test machines we have (including Ampere1, > Neoverse-n1 and Intel Xeon), with "tmp1 = ..." is rewritten to > parallel, the run time all increased significantly. In > contrast, when "tmp1" is not the 1st or 2nd operand of another > FMA (pr110279-1.c), rewriting it results in better performance. > (I'll also append the testcases in the bug tracker.) > > 2. Enhanced checking for nested FMA by: 1) Modified > convert_mult_to_fma so it can return multiple LHS. 2) Check > NEGATE_EXPRs for nested FMA. > > (I think maybe this can be further refined by enabling rewriting > to parallel for very long op list. ) So again, what you do applies to all operations, not just FMA. Consider tmp1 = a + c + d + y; tmp2 = x + tmp1; result += (a + c + d + tmp2); foo (tmp2); where I just removed all multiplications. Since re-assoc works on isolated single-use chains it will rewrite the tmp2 chain to parallel and it will rewrite the result chain to parallel, in the end this results in reassoc-width for 'result' to not be honored because we don't see that at 'tmp2' it will fork again. OTOH the other 'result' arms end, so eventually just two (for reassoc width 2) arms are "active" at any point. That said - isn't the issue that we "overcommit" reassoc-width this way because we apply it locally instead of globally (of course also ignoring every other chain of instructions reassoc isn't interestedin)? Unfortunately we work backwards when processing chains, if we processed leaf chains first we could record the association width applied to the chain at its new root and honor that when such root ends up in the oplist of a consuming chain. But as we work backwards we'd have to record the reassoc width used to in the leafs of the associated chain. So if those become roots of other chains we can then still honor that. Would it work to attack the problem this way? For parallel rewritten chains record the width used? Similar to operand_rank we could use a hash-map from SSA leaf to width it appears in. When we rewrite a chain with such leaf as root we can then subtract the incoming chain width from reassoc-width to lower the width its tail? Richard. > Bootstrapped and regression tested on x86_64-linux-gnu. > > Thanks, > Di Zhao > > ---- > > PR tree-optimization/110279 > > gcc/ChangeLog: > > * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Added > new parameter collect_lhs. > (struct fma_transformation_info): Moved to header. > (class fma_deferring_state): Moved to header. > (convert_mult_to_fma): Added new parameter collect_lhs. > * tree-ssa-math-opts.h (struct fma_transformation_info): > (class fma_deferring_state): Moved from .cc. > (convert_mult_to_fma): Moved from .cc. > * tree-ssa-reassoc.cc (enum fma_state): Defined enum to > describe the state of FMA candidates for a list of > operands. > (rewrite_expr_tree_parallel): Changed boolean parameter > to enum type. > (has_nested_fma_p): New function to check for nested FMA > on given multiplication statement. > (rank_ops_for_fma): Return enum fma_state. > (reassociate_bb): Avoid rewriting to parallel if nested > FMAs are found. > > gcc/testsuite/ChangeLog: > > * gcc.dg/pr110279-1.c: New test. > * gcc.dg/pr110279-2.c: New test.