Hi, A few updates to the patch:
1. rank_ops_for_fma: return FMA_STATE_NESTED only for complete FMA chain, since the regression is obvious only in this case. 2. Added new testcase. Thanks, Di Zhao ---- PR tree-optimization/110279 gcc/ChangeLog: * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Added new parameter collect_lhs. (struct fma_transformation_info): Moved to header. (class fma_deferring_state): Moved to header. (convert_mult_to_fma): Added new parameter collect_lhs. * tree-ssa-math-opts.h (struct fma_transformation_info): (class fma_deferring_state): Moved from .cc. (convert_mult_to_fma): Moved from .cc. * tree-ssa-reassoc.cc (enum fma_state): Defined enum to describe the state of FMA candidates for a list of operands. (rewrite_expr_tree_parallel): Changed boolean parameter to enum type. (has_nested_fma_p): New function to check for nested FMA on given multiplication statement. (rank_ops_for_fma): Return enum fma_state. (reassociate_bb): Avoid rewriting to parallel if nested FMAs are found. gcc/testsuite/ChangeLog: * gcc.dg/pr110279-1.c: New test. * gcc.dg/pr110279-2.c: New test. * gcc.dg/pr110279-3.c: New test. > -----Original Message----- > From: Di Zhao OS > Sent: Thursday, August 10, 2023 12:53 AM > To: gcc-patches@gcc.gnu.org > Cc: Richard Biener <richard.guent...@gmail.com> > Subject: [PATCH v3] tree-optimization/110279- Check for nested FMA in reassoc > > Hi, > > The previous version of this patch tries to solve two problems > at the same time. For better clarity, I'll separate them and > only deal with the "nested" FMA in this version. I plan to > propose another patch in avoiding bad shaped FMA (deferring FMA). > > Other changes: > > 1. Added new testcases for the "nested" FMA issue. For the > following code: > > tmp1 = a + c * c + d * d + x * y; > tmp2 = x * tmp1; > result += (a + c + d + tmp2); > > , when "tmp1 = ..." is not rewritten, tmp1 will be result of > an FMA, and there will be a list of consecutive FMAs: > > _1 = .FMA (c, c, a_39); > _2 = .FMA (d, d, _1); > tmp1 = .FMA (x, y, _2); > _3 = .FMA (tmp1, x, d); > ... > > If "tmp1 = ..." is rewritten to parallel, tmp1 will be result > of a PLUS_EXPR between FMAs: > > _1 = .FMA (c, c, a_39); > _2 = x * y; > _3 = .FMA (d, d, _2); > tmp1 = _3 + _1; > _4 = .FMA (tmp1, x, d); > ... > > It seems the register pressure of the latter is higher than > the former. On the test machines we have (including Ampere1, > Neoverse-n1 and Intel Xeon), with "tmp1 = ..." is rewritten to > parallel, the run time all increased significantly. In > contrast, when "tmp1" is not the 1st or 2nd operand of another > FMA (pr110279-1.c), rewriting it results in better performance. > (I'll also append the testcases in the bug tracker.) > > 2. Enhanced checking for nested FMA by: 1) Modified > convert_mult_to_fma so it can return multiple LHS. 2) Check > NEGATE_EXPRs for nested FMA. > > (I think maybe this can be further refined by enabling rewriting > to parallel for very long op list. ) > > Bootstrapped and regression tested on x86_64-linux-gnu. > > Thanks, > Di Zhao
0001-PATCH-tree-optimization-110279-Check-for-nested-FMA-.patch
Description: 0001-PATCH-tree-optimization-110279-Check-for-nested-FMA-.patch