Hi,

A few updates to the patch:

1. rank_ops_for_fma: return FMA_STATE_NESTED only for complete
   FMA chain, since the regression is obvious only in this case.

2. Added new testcase.

Thanks,
Di Zhao

----

        PR tree-optimization/110279

gcc/ChangeLog:

        * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Added
        new parameter collect_lhs.
        (struct fma_transformation_info): Moved to header.
        (class fma_deferring_state): Moved to header.
        (convert_mult_to_fma): Added new parameter collect_lhs.
        * tree-ssa-math-opts.h (struct fma_transformation_info):
        (class fma_deferring_state): Moved from .cc.
        (convert_mult_to_fma): Moved from .cc.
        * tree-ssa-reassoc.cc (enum fma_state): Defined enum to
        describe the state of FMA candidates for a list of
        operands.
        (rewrite_expr_tree_parallel): Changed boolean parameter
        to enum type.
        (has_nested_fma_p): New function to check for nested FMA
        on given multiplication statement.
        (rank_ops_for_fma): Return enum fma_state.
        (reassociate_bb): Avoid rewriting to parallel if nested
        FMAs are found.

gcc/testsuite/ChangeLog:

        * gcc.dg/pr110279-1.c: New test.
        * gcc.dg/pr110279-2.c: New test.
        * gcc.dg/pr110279-3.c: New test.

> -----Original Message-----
> From: Di Zhao OS
> Sent: Thursday, August 10, 2023 12:53 AM
> To: gcc-patches@gcc.gnu.org
> Cc: Richard Biener <richard.guent...@gmail.com>
> Subject: [PATCH v3] tree-optimization/110279- Check for nested FMA in reassoc
> 
> Hi,
> 
> The previous version of this patch tries to solve two problems
> at the same time. For better clarity, I'll separate them and
> only deal with the "nested" FMA in this version. I plan to
> propose another patch in avoiding bad shaped FMA (deferring FMA).
> 
> Other changes:
> 
> 1. Added new testcases for the "nested" FMA issue. For the
>    following code:
> 
>       tmp1 = a + c * c + d * d + x * y;
>       tmp2 = x * tmp1;
>       result += (a + c + d + tmp2);
> 
>    , when "tmp1 = ..." is not rewritten, tmp1 will be result of
>    an FMA, and there will be a list of consecutive FMAs:
> 
>       _1 = .FMA (c, c, a_39);
>       _2 = .FMA (d, d, _1);
>       tmp1 = .FMA (x, y, _2);
>       _3 = .FMA (tmp1, x, d);
>       ...
> 
>    If "tmp1 = ..." is rewritten to parallel, tmp1 will be result
>    of a PLUS_EXPR between FMAs:
> 
>       _1 = .FMA (c, c, a_39);
>       _2 = x * y;
>       _3 = .FMA (d, d, _2);
>        tmp1 = _3 + _1;
>        _4 = .FMA (tmp1, x, d);
>       ...
> 
>    It seems the register pressure of the latter is higher than
>    the former. On the test machines we have (including Ampere1,
>    Neoverse-n1 and Intel Xeon), with "tmp1 = ..." is rewritten to
>    parallel, the run time all increased significantly. In
>    contrast, when "tmp1" is not the 1st or 2nd operand of another
>    FMA (pr110279-1.c), rewriting it results in better performance.
>    (I'll also append the testcases in the bug tracker.)
> 
> 2. Enhanced checking for nested FMA by: 1) Modified
>    convert_mult_to_fma so it can return multiple LHS.  2) Check
>    NEGATE_EXPRs for nested FMA.
> 
> (I think maybe this can be further refined by enabling rewriting
> to parallel for very long op list. )
> 
> Bootstrapped and regression tested on x86_64-linux-gnu.
> 
> Thanks,
> Di Zhao

Attachment: 0001-PATCH-tree-optimization-110279-Check-for-nested-FMA-.patch
Description: 0001-PATCH-tree-optimization-110279-Check-for-nested-FMA-.patch

Reply via email to