On Tue, Aug 29, 2023 at 9:49 AM Di Zhao OS <diz...@os.amperecomputing.com> wrote: > > Hi, > > > -----Original Message----- > > From: Richard Biener <richard.guent...@gmail.com> > > Sent: Tuesday, August 29, 2023 3:41 PM > > To: Jeff Law <jeffreya...@gmail.com>; Martin Jambor <mjam...@suse.cz> > > Cc: Di Zhao OS <diz...@os.amperecomputing.com>; gcc-patches@gcc.gnu.org > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to > > reduce cross backedge FMA > > > > On Tue, Aug 29, 2023 at 1:23 AM Jeff Law via Gcc-patches > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > > > > > > > On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote: > > > > This patch tries to fix the 2% regression in 510.parest_r on > > > > ampere1 in the tracker. (Previous discussion is here: > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html) > > > > > > > > 1. Add testcases for the problem. For an op list in the form of > > > > "acc = a * b + c * d + acc", currently reassociation doesn't > > > > Swap the operands so that more FMAs can be generated. > > > > After widening_mul the result looks like: > > > > > > > > _1 = .FMA(a, b, acc_0); > > > > acc_1 = .FMA(c, d, _1); > > > > > > > > While previously (before the "Handle FMA friendly..." patch), > > > > widening_mul's result was like: > > > > > > > > _1 = a * b; > > > > _2 = .FMA (c, d, _1); > > > > acc_1 = acc_0 + _2; > > > > How can we execute the multiply and the FMA in parallel? They > > depend on each other. Or is it the uarch can handle dependence > > on the add operand but only when it is with a multiplication and > > not a FMA in some better ways? (I'd doubt so much complexity) > > > > Can you explain in more detail how the uarch executes one vs. the > > other case? > > > > > > If the code fragment is in a loop, some architecture can execute > > > > the latter in parallel, so the performance can be much faster than > > > > the former. For the small testcase, the performance gap is over > > > > 10% on both ampere1 and neoverse-n1. So the point here is to avoid > > > > turning the last statement into FMA, and keep it a PLUS_EXPR as > > > > much as possible. (If we are rewriting the op list into parallel, > > > > no special treatment is needed, since the last statement after > > > > rewrite_expr_tree_parallel will be PLUS_EXPR anyway.) > > > > > > > > 2. Function result_feeds_back_from_phi_p is to check for cross > > > > backedge dependency. Added new enum fma_state to describe the > > > > state of FMA candidates. > > > > > > > > With this patch, there's a 3% improvement in 510.parest_r 1-copy > > > > run on ampere1. The compile options are: > > > > "-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512". > > > > > > > > Best regards, > > > > Di Zhao > > > > > > > > ---- > > > > > > > > PR tree-optimization/110279 > > > > > > > > gcc/ChangeLog: > > > > > > > > * tree-ssa-reassoc.cc (enum fma_state): New enum to > > > > describe the state of FMA candidates for an op list. > > > > (rewrite_expr_tree_parallel): Changed boolean > > > > parameter to enum type. > > > > (result_feeds_back_from_phi_p): New function to check > > > > for cross backedge dependency. > > > > (rank_ops_for_fma): Return enum fma_state. Added new > > > > parameter. > > > > (reassociate_bb): If there's backedge dependency in an > > > > op list, swap the operands before rewrite_expr_tree. > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > * gcc.dg/pr110279.c: New test. > > > Not a review, but more of a question -- isn't this transformation's > > > profitability uarch sensitive. ie, just because it's bad for a set of > > > aarch64 uarches, doesn't mean it's bad everywhere. > > > > > > And in general we shy away from trying to adjust gimple code based on > > > uarch preferences. > > > > > > It seems the right place to do this is gimple->rtl expansion. > > > > Another comment is that FMA forming has this deferring code which I > > think deals exactly with this kind of thing? CCing Martin who did this > > work based on AMD uarchs also not wanting cross-loop dependences > > on FMAs (or so). In particular I see > > > > if (fma_state.m_deferring_p > > && fma_state.m_initial_phi) > > { > > gcc_checking_assert (fma_state.m_last_result); > > if (!last_fma_candidate_feeds_initial_phi (&fma_state, > > &m_last_result_set)) > > cancel_fma_deferring (&fma_state); > > > > and I think code to avoid FMAs in other/related cases should be here > > as well, like avoid forming back-to-back FMAs. > > The changes in this patch is controlled by "param_avoid_fma_max_bits", so > I think it should only affect architectures with similar behavior. (The > parameter was added in a previous patch "Deferring FMA transformations > in tight loops", which seems to be dealing with the same issue.)
That's what I said. Is the pipeline behavior properly modeled by the scheduler description? Your patch seems to not only affect loops but the FMA forming case already present does - is the case you are after not in a tight loop? In principle with a loop with enough scheduling freedom this should be a non-issue - unless the pipeline model isn't accurate and we put the FMAs too close together? Richard. > > Richard. > > > > > Jeff > > Thanks, > Di Zhao