Hello and Ping, Thanks, Di
> -----Original Message----- > From: Di Zhao OS <diz...@os.amperecomputing.com> > Sent: Monday, October 9, 2023 12:40 AM > To: Richard Biener <richard.guent...@gmail.com> > Cc: gcc-patches@gcc.gnu.org > Subject: RE: [PATCH v4] [tree-optimization/110279] Consider FMA in > get_reassociation_width > > Attached is a new version of the patch. > > > -----Original Message----- > > From: Richard Biener <richard.guent...@gmail.com> > > Sent: Friday, October 6, 2023 5:33 PM > > To: Di Zhao OS <diz...@os.amperecomputing.com> > > Cc: gcc-patches@gcc.gnu.org > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in > > get_reassociation_width > > > > On Thu, Sep 14, 2023 at 2:43 PM Di Zhao OS > > <diz...@os.amperecomputing.com> wrote: > > > > > > This is a new version of the patch on "nested FMA". > > > Sorry for updating this after so long, I've been studying and > > > writing micro cases to sort out the cause of the regression. > > > > Sorry for taking so long to reply. > > > > > First, following previous discussion: > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629080.html) > > > > > > 1. From testing more altered cases, I don't think the > > > problem is that reassociation works locally. In that: > > > > > > 1) On the example with multiplications: > > > > > > tmp1 = a + c * c + d * d + x * y; > > > tmp2 = x * tmp1; > > > result += (a + c + d + tmp2); > > > > > > Given "result" rewritten by width=2, the performance is > > > worse if we rewrite "tmp1" with width=2. In contrast, if we > > > remove the multiplications from the example (and make "tmp1" > > > not singe used), and still rewrite "result" by width=2, then > > > rewriting "tmp1" with width=2 is better. (Make sense because > > > the tree's depth at "result" is still smaller if we rewrite > > > "tmp1".) > > > > > > 2) I tried to modify the assembly code of the example without > > > FMA, so the width of "result" is 4. On Ampere1 there's no > > > obvious improvement. So although this is an interesting > > > problem, it doesn't seem like the cause of the regression. > > > > OK, I see. > > > > > 2. From assembly code of the case with FMA, one problem is > > > that, rewriting "tmp1" to parallel didn't decrease the > > > minimum CPU cycles (taking MULT_EXPRs into account), but > > > increased code size, so the overhead is increased. > > > > > > a) When "tmp1" is not re-written to parallel: > > > fmadd d31, d2, d2, d30 > > > fmadd d31, d3, d3, d31 > > > fmadd d31, d4, d5, d31 //"tmp1" > > > fmadd d31, d31, d4, d3 > > > > > > b) When "tmp1" is re-written to parallel: > > > fmul d31, d4, d5 > > > fmadd d27, d2, d2, d30 > > > fmadd d31, d3, d3, d31 > > > fadd d31, d31, d27 //"tmp1" > > > fmadd d31, d31, d4, d3 > > > > > > For version a), there are 3 dependent FMAs to calculate "tmp1". > > > For version b), there are also 3 dependent instructions in the > > > longer path: the 1st, 3rd and 4th. > > > > Yes, it doesn't really change anything. The patch has > > > > + /* If there's code like "acc = a * b + c * d + acc" in a tight loop, some > > + uarchs can execute results like: > > + > > + _1 = a * b; > > + _2 = .FMA (c, d, _1); > > + acc_1 = acc_0 + _2; > > + > > + in parallel, while turning it into > > + > > + _1 = .FMA(a, b, acc_0); > > + acc_1 = .FMA(c, d, _1); > > + > > + hinders that, because then the first FMA depends on the result > > of preceding > > + iteration. */ > > > > I can't see what can be run in parallel for the first case. The .FMA > > depends on the multiplication a * b. Iff the uarch somehow decomposes > > .FMA into multiply + add then the c * d multiply could run in parallel > > with the a * b multiply which _might_ be able to hide some of the > > latency of the full .FMA. Like on x86 Zen FMA has a latency of 4 > > cycles but a multiply only 3. But I never got confirmation from any > > of the CPU designers that .FMAs are issued when the multiply > > operands are ready and the add operand can be forwarded. > > > > I also wonder why the multiplications of the two-FMA sequence > > then cannot be executed at the same time? So I have some doubt > > of the theory above. > > The parallel execution for the code snippet above was the other > issue (previously discussed here: > https://gcc.gnu.org/pipermail/gcc-patches/2023-August/628960.html). > Sorry it's a bit confusing to include that here, but these 2 fixes > needs to be combined to avoid new regressions. Since considering > FMA in get_reassociation_width produces more results of width=1, > so there would be more loop depending FMA chains. > > > Iff this really is the reason for the sequence to execute with lower > > overall latency and we want to attack this on GIMPLE then I think > > we need a target hook telling us this fact (I also wonder if such > > behavior can be modeled in the scheduler pipeline description at all?) > > > > > So it seems to me the current get_reassociation_width algorithm > > > isn't optimal in the presence of FMA. So I modified the patch to > > > improve get_reassociation_width, rather than check for code > > > patterns. (Although there could be some other complicated > > > factors so the regression is more obvious when there's "nested > > > FMA". But with this patch that should be avoided or reduced.) > > > > > > With this patch 508.namd_r 1-copy run has 7% improvement on > > > Ampere1, on Intel Xeon there's about 3%. While I'm still > > > collecting data on other CPUs, I'd like to know how do you > > > think of this. > > > > > > About changes in the patch: > > > > > > 1. When the op list forms a complete FMA chain, try to search > > > for a smaller width considering the benefit of using FMA. With > > > a smaller width, the increment of code size is smaller when > > > breaking the chain. > > > > But this is all highly target specific (code size even more so). > > > > How I understand your approach to fixing the issue leads me to > > the suggestion to prioritize parallel rewriting, thus alter > > rank_ops_for_fma, > > taking the reassoc width into account (the computed width should be > > unchanged from rank_ops_for_fma) instead of "fixing up" the parallel > > rewriting of FMAs (well, they are not yet formed of course). > > get_reassociation_width has 'get_required_cycles', the above theory > > could be verified with a very simple toy pipeline model. We'd have > > to ask the target for the reassoc width for MULT_EXPRs as well (or maybe > > even FMA_EXPRs). > > > > Taking the width of FMAs into account when computing the reassoc width > > might be another way to attack this. > > Previously I tried to solve this generally, on the assumption that > FMA (smaller code size) is preferred. Now I agree it's difficult > since: 1) As you mentioned, the latency of FMA, FMUL and FADD can > be different. 2) From my test result on different machines we > have, it seems simply adding the cycles together is not a good way > to estimate the latency of consecutive FMA. > > I think an easier way to fix this is to add a parameter to suggest > the length of complete FMA chain to keep. (It can be set by target > specific tuning then.) And we can break longer FMA chains for > better parallelism. Attached is the new implementation. With > max-fma-chain-len=8, there's about 7% improvement in spec2017 > 508.namd_r on ampere1, and the overall improvement on fprate is > about 1%. > > Since there's code in rank_ops_for_fma to identify MULT_EXPRs from > others, I left it before get_reassociation_width so the number of > MULT_EXPRs can be used. > > > > > > 2. To avoid regressions, included the other patch > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629203.html) > > > on this tracker again. This is because more FMA will be kept > > > with 1., so we need to rule out the loop dependent > > > FMA chains when param_avoid_fma_max_bits is set. > > > > Sorry again for taking so long to reply. > > > > I'll note we have an odd case on x86 Zen2(?) as well which we don't really > > understand from a CPU behavior perspective. > > > > Thanks, > > Richard. > > > > > Thanks, > > > Di Zhao > > > > > > ---- > > > > > > PR tree-optimization/110279 > > > > > > gcc/ChangeLog: > > > > > > * tree-ssa-reassoc.cc (rank_ops_for_better_parallelism_p): > > > New function to check whether ranking the ops results in > > > better parallelism. > > > (get_reassociation_width): Add new parameters. Search for > > > smaller width considering the benefit of FMA. > > > (rank_ops_for_fma): Change return value to be number of > > > MULT_EXPRs. > > > (reassociate_bb): For 3 ops, refine the condition to call > > > swap_ops_for_binary_stmt. > > > > > > gcc/testsuite/ChangeLog: > > > > > > * gcc.dg/pr110279.c: New test. > > Thanks, > Di Zhao > > ---- > > PR tree-optimization/110279 > > gcc/ChangeLog: > > * doc/invoke.texi: Description of param_max_fma_chain_len. > * params.opt: New parameter param_max_fma_chain_len. > * tree-ssa-reassoc.cc (get_reassociation_width): > Support param_max_fma_chain_len; check for loop dependent > FMAs. > (rank_ops_for_fma): Return the number of MULT_EXPRs. > (reassociate_bb): For 3 ops, refine the condition to call > swap_ops_for_binary_stmt. > > gcc/testsuite/ChangeLog: > > * gcc.dg/pr110279-1.c: New test. > * gcc.dg/pr110279-2.c: New test. > * gcc.dg/pr110279-3.c: New test.