On Wed, Aug 9, 2023 at 6:53 PM Di Zhao OS <diz...@os.amperecomputing.com> wrote:
>
> Hi,
>
> The previous version of this patch tries to solve two problems
> at the same time. For better clarity, I'll separate them and
> only deal with the "nested" FMA in this version. I plan to
> propose another patch in avoiding bad shaped FMA (deferring FMA).
>
> Other changes:
>
> 1. Added new testcases for the "nested" FMA issue. For the
>    following code:
>
>         tmp1 = a + c * c + d * d + x * y;
>         tmp2 = x * tmp1;
>         result += (a + c + d + tmp2);
>
>    , when "tmp1 = ..." is not rewritten, tmp1 will be result of
>    an FMA, and there will be a list of consecutive FMAs:
>
>         _1 = .FMA (c, c, a_39);
>         _2 = .FMA (d, d, _1);
>         tmp1 = .FMA (x, y, _2);
>         _3 = .FMA (tmp1, x, d);
>         ...
>
>    If "tmp1 = ..." is rewritten to parallel, tmp1 will be result
>    of a PLUS_EXPR between FMAs:
>
>         _1 = .FMA (c, c, a_39);
>         _2 = x * y;
>         _3 = .FMA (d, d, _2);
>          tmp1 = _3 + _1;
>          _4 = .FMA (tmp1, x, d);
>         ...
>
>    It seems the register pressure of the latter is higher than
>    the former.

Yes, that's a natural consequence of rewriting to parallel.

>    On the test machines we have (including Ampere1,
>    Neoverse-n1 and Intel Xeon), with "tmp1 = ..." is rewritten to
>    parallel, the run time all increased significantly. In
>    contrast, when "tmp1" is not the 1st or 2nd operand of another
>    FMA (pr110279-1.c), rewriting it results in better performance.
>    (I'll also append the testcases in the bug tracker.)
>
> 2. Enhanced checking for nested FMA by: 1) Modified
>    convert_mult_to_fma so it can return multiple LHS.  2) Check
>    NEGATE_EXPRs for nested FMA.
>
> (I think maybe this can be further refined by enabling rewriting
> to parallel for very long op list. )

So again, what you do applies to all operations, not just FMA.
Consider

      tmp1 = a + c + d + y;
      tmp2 = x + tmp1;
      result += (a + c + d + tmp2);
      foo (tmp2);

where I just removed all multiplications.  Since re-assoc works
on isolated single-use chains it will rewrite the tmp2 chain
to parallel and it will rewrite the result chain to parallel, in
the end this results in reassoc-width for 'result' to not be honored
because we don't see that at 'tmp2' it will fork again.  OTOH
the other 'result' arms end, so eventually just two (for reassoc
width 2) arms are "active" at any point.

That said - isn't the issue that we "overcommit" reassoc-width
this way because we apply it locally instead of globally
(of course also ignoring every other chain of instructions
reassoc isn't interestedin)?

Unfortunately we work backwards when processing chains,
if we processed leaf chains first we could record the
association width applied to the chain at its new root and
honor that when such root ends up in the oplist of a consuming
chain.  But as we work backwards we'd have to record
the reassoc width used to in the leafs of the associated
chain.  So if those become roots of other chains we can
then still honor that.

Would it work to attack the problem this way?  For
parallel rewritten chains record the width used?
Similar to operand_rank we could use a hash-map
from SSA leaf to width it appears in.  When we rewrite
a chain with such leaf as root we can then subtract
the incoming chain width from reassoc-width to lower
the width its tail?

Richard.

> Bootstrapped and regression tested on x86_64-linux-gnu.
>
> Thanks,
> Di Zhao
>
> ----
>
>         PR tree-optimization/110279
>
> gcc/ChangeLog:
>
>         * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Added
>         new parameter collect_lhs.
>         (struct fma_transformation_info): Moved to header.
>         (class fma_deferring_state): Moved to header.
>         (convert_mult_to_fma): Added new parameter collect_lhs.
>         * tree-ssa-math-opts.h (struct fma_transformation_info):
>         (class fma_deferring_state): Moved from .cc.
>         (convert_mult_to_fma): Moved from .cc.
>         * tree-ssa-reassoc.cc (enum fma_state): Defined enum to
>         describe the state of FMA candidates for a list of
>         operands.
>         (rewrite_expr_tree_parallel): Changed boolean parameter
>         to enum type.
>         (has_nested_fma_p): New function to check for nested FMA
>         on given multiplication statement.
>         (rank_ops_for_fma): Return enum fma_state.
>         (reassociate_bb): Avoid rewriting to parallel if nested
>         FMAs are found.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/pr110279-1.c: New test.
>         * gcc.dg/pr110279-2.c: New test.

Reply via email to