Re: [PATCH] match.pd: Add std::pow folding optimizations.

Richard Biener Fri, 25 Oct 2024 05:39:42 -0700

On Wed, 23 Oct 2024, Jennifer Schmitz wrote:

> 
> 
> > On 22 Oct 2024, at 13:14, Richard Biener <rguent...@suse.de> wrote:
> > 
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Tue, 22 Oct 2024, Jennifer Schmitz wrote:
> > 
> >> 
> >> 
> >>> On 22 Oct 2024, at 11:05, Richard Biener <rguent...@suse.de> wrote:
> >>> 
> >>> External email: Use caution opening links or attachments
> >>> 
> >>> 
> >>> On Tue, 22 Oct 2024, Jennifer Schmitz wrote:
> >>> 
> >>>> 
> >>>> 
> >>>>> On 21 Oct 2024, at 10:51, Richard Biener <rguent...@suse.de> wrote:
> >>>>> 
> >>>>> External email: Use caution opening links or attachments
> >>>>> 
> >>>>> 
> >>>>> On Fri, 18 Oct 2024, Jennifer Schmitz wrote:
> >>>>> 
> >>>>>> This patch adds the following two simplifications in match.pd:
> >>>>>> - pow (1.0/x, y) to pow (x, -y), avoiding the division
> >>>>>> - pow (0.0, x) to 0.0, avoiding the call to pow.
> >>>>>> The patterns are guarded by flag_unsafe_math_optimizations,
> >>>>>> !flag_trapping_math, !flag_errno_math, !HONOR_SIGNED_ZEROS,
> >>>>>> and !HONOR_INFINITIES.
> >>>>>> 
> >>>>>> Tests were added to confirm the application of the transform for float,
> >>>>>> double, and long double.
> >>>>>> 
> >>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu and
> >>>>>> x86_64-linux-gnu, no regression.
> >>>>>> OK for mainline?
> >>>>>> 
> >>>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
> >>>>>> 
> >>>>>> gcc/
> >>>>>>    * match.pd: Fold pow (1.0/x, y) -> pow (x, -y) and
> >>>>>>    pow (0.0, x) -> 0.0.
> >>>>>> 
> >>>>>> gcc/testsuite/
> >>>>>>    * gcc.dg/tree-ssa/pow_fold_1.c: New test.
> >>>>>> ---
> >>>>>> gcc/match.pd                               | 14 +++++++++
> >>>>>> gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c | 34 ++++++++++++++++++++++
> >>>>>> 2 files changed, 48 insertions(+)
> >>>>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c
> >>>>>> 
> >>>>>> diff --git a/gcc/match.pd b/gcc/match.pd
> >>>>>> index 12d81fcac0d..ba100b117e7 100644
> >>>>>> --- a/gcc/match.pd
> >>>>>> +++ b/gcc/match.pd
> >>>>>> @@ -8203,6 +8203,20 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> >>>>>>  (rdiv @0 (exps:s @1))
> >>>>>>   (mult @0 (exps (negate @1)))))
> >>>>>> 
> >>>>>> + /* Simplify pow(1.0/x, y) into pow(x, -y).  */
> >>>>>> + (if (! HONOR_INFINITIES (type)
> >>>>>> +      && ! HONOR_SIGNED_ZEROS (type)
> >>>>>> +      && ! flag_trapping_math
> >>>>>> +      && ! flag_errno_math)
> >>>>>> +  (simplify
> >>>>>> +   (POW (rdiv:s real_onep@0 @1) @2)
> >>>>>> +    (POW @1 (negate @2)))
> >>>>> 
> >>>>> This one shouldn't need HONOR_SIGNED_ZEROS?
> >>>>> 
> >>>>>> +
> >>>>>> +  /* Simplify pow(0.0, x) into 0.0.  */
> >>>>>> +  (simplify
> >>>>>> +   (POW real_zerop@0 @1)
> >>>>> 
> >>>>> I think this needs !HONOR_NANS (type)?
> >>>>> 
> >>>>> Otherwise OK.
> >>>> Thanks for the feedback, Richard and Andrew. I made the following 
> >>>> changes to the patch (current version of the patch below):
> >>>> - also applied the pattern to POWI and added tests for pow, powif, powil
> >>>> - not gate first pattern under !HONOR_SIGNED_ZEROS, but second one 
> >>>> additionally under !HONOR_NANS (type)
> >>>> - added tests for powf16
> >>> 
> >>> Note powi is GCC internal, it doesn't set errno and it should be subject
> >>> to different rules - I'd rather have patterns working on powi separate.
> >> How about moving the patterns for POWI into the section 
> >> flag_unsafe_math_optimizations && canonicalize_math_p () and not use 
> >> (!flag_errno_math)?
> > 
> > Sounds good.
> > 
> >>> 
> >>>> Now, I am encountering two problems:
> >>>> 
> >>>> First, the transform is not applied for float16 (even if 
> >>>> -fexcess-precision=16). Do you know what the problem could be?
> >>> 
> >>> I think you want to use POW_ALL instead of POW.  The generated
> >>> cfn-operators.pd shows
> >>> 
> >>> (define_operator_list POW
> >>>   BUILT_IN_POWF
> >>>   BUILT_IN_POW
> >>>   BUILT_IN_POWL
> >>>   IFN_POW)
> >>> (define_operator_list POW_FN
> >>>   BUILT_IN_POWF16
> >>>   BUILT_IN_POWF32
> >>>   BUILT_IN_POWF64
> >>>   BUILT_IN_POWF128
> >>>   BUILT_IN_POWF32X
> >>>   BUILT_IN_POWF64X
> >>>   BUILT_IN_POWF128X
> >>>   null)
> >>> (define_operator_list POW_ALL
> >>>   BUILT_IN_POWF
> >>>   BUILT_IN_POW
> >>>   BUILT_IN_POWL
> >>>   BUILT_IN_POWF16
> >>> ...
> >>> 
> >>> note this comes at expense of more generated code (in
> >>> gimple/generic-match.pd).
> >> Thanks, that solved the Float16 issue.
> >>> 
> >>>> Second, validation on aarch64 shows a regression in tests
> >>>> - gcc.dg/recip_sqrt_mult_1.c and
> >>>> - gcc.dg/recip_sqrt_mult_5.c,
> >>>> because the pattern (POWI(1/x, y) -> POWI(x, -y)) is applied before the 
> >>>> recip pass and prevents application of the recip-patterns. The reason 
> >>>> for this might be that the single-use restriction only work if the 
> >>>> integer argument is non-constant, but in the failing test cases, the 
> >>>> integer argument is 2 and the pattern is applied despite the :s flag.
> >>>> For example, my pattern is **not** applied (single-use restriction 
> >>>> works) for:
> >>>> double res, res2;
> >>>> void foo (double a, int b)
> >>>> {
> >>>> double f (double);
> >>>> double t1 = 1.0 / a;
> >>>> res = __builtin_powi (t1, b);
> >>>> res2 = f (t1);
> >>>> }
> >>>> 
> >>>> But the pattern **is** applied and single-use restriction does **not** 
> >>>> work for:
> >>>> double res, res2;
> >>>> void foo (double a)
> >>>> {
> >>>> double f (double);
> >>>> double t1 = 1.0 / a;
> >>>> res = __builtin_powi (t1, 2);
> >>>> res2 = f (t1);
> >>>> }
> >>> 
> >>> This must be because the result is a single operation.  :s only applies
> >>> when the result has sub-expresions.  This is to make CSE work.
> >>> The "fix" is to add explicit && single_use (@n) to override that
> >>> behavior.  Note that I think the transform is good even when the
> >>> division is used because the result reduces the dependence chain length.
> >>> It's only when @2 is non-constant that we're introducing another
> >>> stmt for the negation that re-introduces this latency (even if in
> >>> practice it would be smaller).
> >>> 
> >>>> Possible options to resolve this are:
> >>>> - gate pattern to run after recip pass
> >>>> - do not apply pattern for POWI
> >>> 
> >>> - adjust the testcase (is the final outcome still good?)
> >> Without the patch, there is one fdiv instruction less (below is the 
> >> assembly for recip_sqrt_mult_1.c, but for _5.c it’s analogous):
> >> No patch or with single_use of the result of the division:
> >> foo:
> >>      fmov d30, 1.0e+0
> >>      fsqrt d31, d0
> >>      adrp x0, .LANCHOR0
> >>      add x1, x0, :lo12:.LANCHOR0
> >>      fdiv d30, d30, d0
> >>      fmul d0, d31, d30
> >>      str d0, [x0, #:lo12:.LANCHOR0]
> >>      stp d30, d31, [x1, 8]
> >>      ret
> >> 
> >> With patch:
> >> foo:
> >>        fsqrt   d31, d0
> >>        fmov    d30, 1.0e+0
> >>        adrp    x1, .LANCHOR0
> >>        add     x0, x1, :lo12:.LANCHOR0
> >>        fdiv    d31, d30, d31
> >>        fdiv    d30, d30, d0
> >>        str     d31, [x1, #:lo12:.LANCHOR0]
> >>        fmul    d31, d31, d0
> >>        stp     d30, d31, [x0, 8]
> >>        ret
> >> So, we might want to use the single_use guard.
> > 
> > Yeah, this is because the powi inline expansion will add back
> > the division.
> Below is the updated patch, I re-validated with no regression on aarch64 and 
> x86_64.
> Thanks,
> Jenni
> 
> This patch adds the following two simplifications in match.pd for
> POW_ALL and POWI:
> - pow (1.0/x, y) to pow (x, -y), avoiding the division
> - pow (0.0, x) to 0.0, avoiding the call to pow.
> The patterns are guarded by flag_unsafe_math_optimizations,
> !flag_trapping_math, and !HONOR_INFINITIES.
> The POW_ALL patterns are also gated under !flag_errno_math.
> The second pattern is also guarded by !HONOR_NANS and
> !HONOR_SIGNED_ZEROS.
> 
> Tests were added to confirm the application of the transform for
> builtins pow, powf, powl, powi, powif, powil, and powf16.
> 
> The patch was bootstrapped and regtested on aarch64-linux-gnu and
> x86_64-linux-gnu, no regression.
> OK for mainline?


OK.

Thanks,
Richard.

> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
> 
> gcc/
>       * match.pd: Fold pow (1.0/x, y) -> pow (x, -y) and
>       pow (0.0, x) -> 0.0.
> 
> gcc/testsuite/
>       * gcc.dg/tree-ssa/pow_fold_1.c: New test.
> ---
>  gcc/match.pd                               | 28 +++++++++++++++
>  gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c | 42 ++++++++++++++++++++++
>  2 files changed, 70 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c
> 
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 12d81fcac0d..6d9868d2bb1 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -8203,6 +8203,21 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>     (rdiv @0 (exps:s @1))
>      (mult @0 (exps (negate @1)))))
>  
> + (for pow (POW_ALL)
> +  (if (! HONOR_INFINITIES (type)
> +       && ! flag_trapping_math
> +       && ! flag_errno_math)
> +   /* Simplify pow(1.0/x, y) into pow(x, -y).  */
> +   (simplify
> +    (pow (rdiv:s real_onep@0 @1) @2)
> +     (pow @1 (negate @2)))
> +
> +   /* Simplify pow(0.0, x) into 0.0.  */
> +   (if (! HONOR_NANS (type) && ! HONOR_SIGNED_ZEROS (type))
> +    (simplify
> +     (pow real_zerop@0 @1)
> +      @0))))
> +
>   (if (! HONOR_SIGN_DEPENDENT_ROUNDING (type)
>        && ! HONOR_NANS (type) && ! HONOR_INFINITIES (type)
>        && ! flag_trapping_math
> @@ -8561,6 +8576,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>    (mult (POW:s @0 @1) (POW:s @2 @1))
>     (POW (mult @0 @2) @1))
>  
> + (if (! HONOR_INFINITIES (type) && ! flag_trapping_math)
> +  /* Simplify powi(1.0/x, y) into powi(x, -y).  */
> +  (simplify
> +   (POWI (rdiv@3 real_onep@0 @1) @2)
> +   (if (single_use (@3))
> +    (POWI @1 (negate @2))))
> +
> +  /* Simplify powi(0.0, x) into 0.0.  */
> +  (if (! HONOR_NANS (type) && ! HONOR_SIGNED_ZEROS (type))
> +   (simplify
> +    (POWI real_zerop@0 @1)
> +     @0)))
> +
>   /* Simplify powi(x,y) * powi(z,y) -> powi(x*z,y). */
>   (simplify
>    (mult (POWI:s @0 @1) (POWI:s @2 @1))
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c
> new file mode 100644
> index 00000000000..d98bcb0827e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c
> @@ -0,0 +1,42 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -fdump-tree-optimized -fexcess-precision=16" } */
> +/* { dg-add-options float16 } */
> +/* { dg-require-effective-target float16_runtime } */
> +/* { dg-require-effective-target c99_runtime } */
> +
> +extern void link_error (void);
> +
> +#define POW1OVER(TYPE1, TYPE2, CTY, TY)                      \
> +  void                                                       \
> +  pow1over_##TY (TYPE1 x, TYPE2 y)                   \
> +  {                                                  \
> +    TYPE1 t1 = 1.0##CTY / x;                         \
> +    TYPE1 t2 = __builtin_pow##TY (t1, y);            \
> +    TYPE2 t3 = -y;                                   \
> +    TYPE1 t4 = __builtin_pow##TY (x, t3);            \
> +    if (t2 != t4)                                    \
> +      link_error ();                                 \
> +  }                                                  \
> +
> +#define POW0(TYPE1, TYPE2, CTY, TY)                  \
> +  void                                                       \
> +  pow0_##TY (TYPE2 x)                                        \
> +  {                                                  \
> +    TYPE1 t1 = __builtin_pow##TY (0.0##CTY, x);              \
> +    if (t1 != 0.0##CTY)                                      \
> +      link_error ();                                 \
> +  }                                                  \
> +
> +#define TEST_ALL(TYPE1, TYPE2, CTY, TY)                      \
> +  POW1OVER (TYPE1, TYPE2, CTY, TY)                   \
> +  POW0 (TYPE1, TYPE2, CTY, TY)
> +
> +TEST_ALL (double, double, , )
> +TEST_ALL (float, float, f, f)
> +TEST_ALL (_Float16, _Float16, f16, f16)
> +TEST_ALL (long double, long double, L, l)
> +TEST_ALL (double, int, , i)
> +TEST_ALL (float, int, f, if)
> +TEST_ALL (long double, int, L, il)
> +
> +/* { dg-final { scan-tree-dump-not "link_error" "optimized" } } */
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH] match.pd: Add std::pow folding optimizations.

Reply via email to