Hi, The previous version of this patch tries to solve two problems at the same time. For better clarity, I'll separate them and only deal with the "nested" FMA in this version. I plan to propose another patch in avoiding bad shaped FMA (deferring FMA).
Other changes: 1. Added new testcases for the "nested" FMA issue. For the following code: tmp1 = a + c * c + d * d + x * y; tmp2 = x * tmp1; result += (a + c + d + tmp2); , when "tmp1 = ..." is not rewritten, tmp1 will be result of an FMA, and there will be a list of consecutive FMAs: _1 = .FMA (c, c, a_39); _2 = .FMA (d, d, _1); tmp1 = .FMA (x, y, _2); _3 = .FMA (tmp1, x, d); ... If "tmp1 = ..." is rewritten to parallel, tmp1 will be result of a PLUS_EXPR between FMAs: _1 = .FMA (c, c, a_39); _2 = x * y; _3 = .FMA (d, d, _2); tmp1 = _3 + _1; _4 = .FMA (tmp1, x, d); ... It seems the register pressure of the latter is higher than the former. On the test machines we have (including Ampere1, Neoverse-n1 and Intel Xeon), with "tmp1 = ..." is rewritten to parallel, the run time all increased significantly. In contrast, when "tmp1" is not the 1st or 2nd operand of another FMA (pr110279-1.c), rewriting it results in better performance. (I'll also append the testcases in the bug tracker.) 2. Enhanced checking for nested FMA by: 1) Modified convert_mult_to_fma so it can return multiple LHS. 2) Check NEGATE_EXPRs for nested FMA. (I think maybe this can be further refined by enabling rewriting to parallel for very long op list. ) Bootstrapped and regression tested on x86_64-linux-gnu. Thanks, Di Zhao ---- PR tree-optimization/110279 gcc/ChangeLog: * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Added new parameter collect_lhs. (struct fma_transformation_info): Moved to header. (class fma_deferring_state): Moved to header. (convert_mult_to_fma): Added new parameter collect_lhs. * tree-ssa-math-opts.h (struct fma_transformation_info): (class fma_deferring_state): Moved from .cc. (convert_mult_to_fma): Moved from .cc. * tree-ssa-reassoc.cc (enum fma_state): Defined enum to describe the state of FMA candidates for a list of operands. (rewrite_expr_tree_parallel): Changed boolean parameter to enum type. (has_nested_fma_p): New function to check for nested FMA on given multiplication statement. (rank_ops_for_fma): Return enum fma_state. (reassociate_bb): Avoid rewriting to parallel if nested FMAs are found. gcc/testsuite/ChangeLog: * gcc.dg/pr110279-1.c: New test. * gcc.dg/pr110279-2.c: New test.
tree-optimization-110279-Check-for-nested-FMA-in-rea.patch
Description: tree-optimization-110279-Check-for-nested-FMA-in-rea.patch