Sorry I've missed the recent updates on trunk regarding handling FMA. I'll measure again if something in this still helps.
Thanks, Di Zhao > -----Original Message----- > From: Di Zhao OS > Sent: Friday, May 26, 2023 3:15 PM > To: gcc-patches@gcc.gnu.org > Subject: [RFC][PATCH] Improve generating FMA by adding a widening_mul pass > > As GCC's reassociation pass does not have knowledge of FMA, when > transforming expression lists to parallel, it reduces the > opportunities to generate FMAs. Currently there's a workaround > on AArch64 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84114), > that is, to disable the parallelization with floating-point additions. > However, this approach may cause regressions. For example, in the > code below there are only floating-point additions when calculating > "result += array[j]", and rewriting to parallel is better: > > // Compile with -Ofast on aarch64 > float foo (int n, float in) > { > float array[8] = { 0.1, 1.0, 1.1, 100.0, 10.5, 0.5, 0.01, 9.9 }; > float result = 0.0; > for (int i = 0; i < n; i++) > { > if (i % 10) > for (unsigned j = 0; j < 8; j++) > array[j] *= in; > > for (unsigned j = 0; j < 8; j++) > result += array[j]; > } > return result; > } > > To improve this, one option is to count the number of MUL_EXPRs in an > operator list before rewriting to parallel, and allow the rewriting > when there's none (or 1 MUL_EXPR). This is simple and unlikely to > introduce regressions. However it lacks flexibility and can not handle > more general cases. > > Here's an attempt to address the issue more generally. > > 1. Added an additional widening_mul pass before the original reassoc2 > pass. The new pass is limited to only insert FMA, and leave other > operations like convert_mult_to_widen to the old late widening_mul pass, > in case other optimizations between the two passes could be hindered. > > 2. On some platforms, for a very long FMA chain, rewriting to parallel > can be faster. Extended the original "deferring" logic so that all > conversions to FMA can be deferred. Introduced a new parameter > op-count-prefer-reassoc to control this behavior. > > 3. Additionally, the new widening_mul pass calls execute_reassoc first, > to avoid losing opportunities such as folding constants and > undistributing. > > However, changing the sequence of generating FMA and reassociation may > expose more FMA chains that are slow (see commit 4a0d0ed2). > To reduce possible regressions, improved handling the slow FMA chain by: > > 1. Modified result_of_phi to support checking an additional FADD/FMUL. > > 2. On some CPUs, rather than removing the whole FMA chain, only skipping > a few candidates may generate faster code. Added new parameter > fskip-fma-heuristic to control this behavior. > > This patch also solves https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350. > > Thanks, > Di Zhao