Sorry I've missed the recent updates on trunk regarding handling FMA.
I'll measure again if something in this still helps.

Thanks,
Di Zhao

> -----Original Message-----
> From: Di Zhao OS
> Sent: Friday, May 26, 2023 3:15 PM
> To: gcc-patches@gcc.gnu.org
> Subject: [RFC][PATCH] Improve generating FMA by adding a widening_mul pass
> 
> As GCC's reassociation pass does not have knowledge of FMA, when
> transforming expression lists to parallel, it reduces the
> opportunities to generate FMAs. Currently there's a workaround
> on AArch64 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84114),
> that is, to disable the parallelization with floating-point additions.
> However, this approach may cause regressions. For example, in the
> code below there are only floating-point additions when calculating
> "result += array[j]", and rewriting to parallel is better:
> 
> // Compile with -Ofast on aarch64
> float foo (int n, float in)
> {
>   float array[8] = { 0.1, 1.0, 1.1, 100.0, 10.5, 0.5, 0.01, 9.9 };
>   float result = 0.0;
>   for (int i = 0; i < n; i++)
>     {
>       if (i % 10)
>         for (unsigned j = 0; j < 8; j++)
>           array[j] *= in;
> 
>       for (unsigned j = 0; j < 8; j++)
>        result += array[j];
>     }
>   return result;
> }
> 
> To improve this, one option is to count the number of MUL_EXPRs in an
> operator list before rewriting to parallel, and allow the rewriting
> when there's none (or 1 MUL_EXPR). This is simple and unlikely to
> introduce regressions. However it lacks flexibility and can not handle
> more general cases.
> 
> Here's an attempt to address the issue more generally.
> 
> 1. Added an additional widening_mul pass before the original reassoc2
> pass. The new pass is limited to only insert FMA, and leave other
> operations like convert_mult_to_widen to the old late widening_mul pass,
> in case other optimizations between the two passes could be hindered.
> 
> 2. On some platforms, for a very long FMA chain, rewriting to parallel
> can be faster. Extended the original "deferring" logic so that all
> conversions to FMA can be deferred. Introduced a new parameter
> op-count-prefer-reassoc to control this behavior.
> 
> 3. Additionally, the new widening_mul pass calls execute_reassoc first,
> to avoid losing opportunities such as folding constants and
> undistributing.
> 
> However, changing the sequence of generating FMA and reassociation may
> expose more FMA chains that are slow (see commit 4a0d0ed2).
> To reduce possible regressions, improved handling the slow FMA chain by:
> 
> 1. Modified result_of_phi to support checking an additional FADD/FMUL.
> 
> 2. On some CPUs, rather than removing the whole FMA chain, only skipping
> a few candidates may generate faster code. Added new parameter
> fskip-fma-heuristic to control this behavior.
> 
> This patch also solves https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350.
> 
> Thanks,
> Di Zhao

Reply via email to