On Wed, 23 Oct 2024, Jennifer Schmitz wrote: > > > > On 22 Oct 2024, at 13:14, Richard Biener <rguent...@suse.de> wrote: > > > > External email: Use caution opening links or attachments > > > > > > On Tue, 22 Oct 2024, Jennifer Schmitz wrote: > > > >> > >> > >>> On 22 Oct 2024, at 11:05, Richard Biener <rguent...@suse.de> wrote: > >>> > >>> External email: Use caution opening links or attachments > >>> > >>> > >>> On Tue, 22 Oct 2024, Jennifer Schmitz wrote: > >>> > >>>> > >>>> > >>>>> On 21 Oct 2024, at 10:51, Richard Biener <rguent...@suse.de> wrote: > >>>>> > >>>>> External email: Use caution opening links or attachments > >>>>> > >>>>> > >>>>> On Fri, 18 Oct 2024, Jennifer Schmitz wrote: > >>>>> > >>>>>> This patch adds the following two simplifications in match.pd: > >>>>>> - pow (1.0/x, y) to pow (x, -y), avoiding the division > >>>>>> - pow (0.0, x) to 0.0, avoiding the call to pow. > >>>>>> The patterns are guarded by flag_unsafe_math_optimizations, > >>>>>> !flag_trapping_math, !flag_errno_math, !HONOR_SIGNED_ZEROS, > >>>>>> and !HONOR_INFINITIES. > >>>>>> > >>>>>> Tests were added to confirm the application of the transform for float, > >>>>>> double, and long double. > >>>>>> > >>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu and > >>>>>> x86_64-linux-gnu, no regression. > >>>>>> OK for mainline? > >>>>>> > >>>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> > >>>>>> > >>>>>> gcc/ > >>>>>> * match.pd: Fold pow (1.0/x, y) -> pow (x, -y) and > >>>>>> pow (0.0, x) -> 0.0. > >>>>>> > >>>>>> gcc/testsuite/ > >>>>>> * gcc.dg/tree-ssa/pow_fold_1.c: New test. > >>>>>> --- > >>>>>> gcc/match.pd | 14 +++++++++ > >>>>>> gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c | 34 ++++++++++++++++++++++ > >>>>>> 2 files changed, 48 insertions(+) > >>>>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c > >>>>>> > >>>>>> diff --git a/gcc/match.pd b/gcc/match.pd > >>>>>> index 12d81fcac0d..ba100b117e7 100644 > >>>>>> --- a/gcc/match.pd > >>>>>> +++ b/gcc/match.pd > >>>>>> @@ -8203,6 +8203,20 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > >>>>>> (rdiv @0 (exps:s @1)) > >>>>>> (mult @0 (exps (negate @1))))) > >>>>>> > >>>>>> + /* Simplify pow(1.0/x, y) into pow(x, -y). */ > >>>>>> + (if (! HONOR_INFINITIES (type) > >>>>>> + && ! HONOR_SIGNED_ZEROS (type) > >>>>>> + && ! flag_trapping_math > >>>>>> + && ! flag_errno_math) > >>>>>> + (simplify > >>>>>> + (POW (rdiv:s real_onep@0 @1) @2) > >>>>>> + (POW @1 (negate @2))) > >>>>> > >>>>> This one shouldn't need HONOR_SIGNED_ZEROS? > >>>>> > >>>>>> + > >>>>>> + /* Simplify pow(0.0, x) into 0.0. */ > >>>>>> + (simplify > >>>>>> + (POW real_zerop@0 @1) > >>>>> > >>>>> I think this needs !HONOR_NANS (type)? > >>>>> > >>>>> Otherwise OK. > >>>> Thanks for the feedback, Richard and Andrew. I made the following > >>>> changes to the patch (current version of the patch below): > >>>> - also applied the pattern to POWI and added tests for pow, powif, powil > >>>> - not gate first pattern under !HONOR_SIGNED_ZEROS, but second one > >>>> additionally under !HONOR_NANS (type) > >>>> - added tests for powf16 > >>> > >>> Note powi is GCC internal, it doesn't set errno and it should be subject > >>> to different rules - I'd rather have patterns working on powi separate. > >> How about moving the patterns for POWI into the section > >> flag_unsafe_math_optimizations && canonicalize_math_p () and not use > >> (!flag_errno_math)? > > > > Sounds good. > > > >>> > >>>> Now, I am encountering two problems: > >>>> > >>>> First, the transform is not applied for float16 (even if > >>>> -fexcess-precision=16). Do you know what the problem could be? > >>> > >>> I think you want to use POW_ALL instead of POW. The generated > >>> cfn-operators.pd shows > >>> > >>> (define_operator_list POW > >>> BUILT_IN_POWF > >>> BUILT_IN_POW > >>> BUILT_IN_POWL > >>> IFN_POW) > >>> (define_operator_list POW_FN > >>> BUILT_IN_POWF16 > >>> BUILT_IN_POWF32 > >>> BUILT_IN_POWF64 > >>> BUILT_IN_POWF128 > >>> BUILT_IN_POWF32X > >>> BUILT_IN_POWF64X > >>> BUILT_IN_POWF128X > >>> null) > >>> (define_operator_list POW_ALL > >>> BUILT_IN_POWF > >>> BUILT_IN_POW > >>> BUILT_IN_POWL > >>> BUILT_IN_POWF16 > >>> ... > >>> > >>> note this comes at expense of more generated code (in > >>> gimple/generic-match.pd). > >> Thanks, that solved the Float16 issue. > >>> > >>>> Second, validation on aarch64 shows a regression in tests > >>>> - gcc.dg/recip_sqrt_mult_1.c and > >>>> - gcc.dg/recip_sqrt_mult_5.c, > >>>> because the pattern (POWI(1/x, y) -> POWI(x, -y)) is applied before the > >>>> recip pass and prevents application of the recip-patterns. The reason > >>>> for this might be that the single-use restriction only work if the > >>>> integer argument is non-constant, but in the failing test cases, the > >>>> integer argument is 2 and the pattern is applied despite the :s flag. > >>>> For example, my pattern is **not** applied (single-use restriction > >>>> works) for: > >>>> double res, res2; > >>>> void foo (double a, int b) > >>>> { > >>>> double f (double); > >>>> double t1 = 1.0 / a; > >>>> res = __builtin_powi (t1, b); > >>>> res2 = f (t1); > >>>> } > >>>> > >>>> But the pattern **is** applied and single-use restriction does **not** > >>>> work for: > >>>> double res, res2; > >>>> void foo (double a) > >>>> { > >>>> double f (double); > >>>> double t1 = 1.0 / a; > >>>> res = __builtin_powi (t1, 2); > >>>> res2 = f (t1); > >>>> } > >>> > >>> This must be because the result is a single operation. :s only applies > >>> when the result has sub-expresions. This is to make CSE work. > >>> The "fix" is to add explicit && single_use (@n) to override that > >>> behavior. Note that I think the transform is good even when the > >>> division is used because the result reduces the dependence chain length. > >>> It's only when @2 is non-constant that we're introducing another > >>> stmt for the negation that re-introduces this latency (even if in > >>> practice it would be smaller). > >>> > >>>> Possible options to resolve this are: > >>>> - gate pattern to run after recip pass > >>>> - do not apply pattern for POWI > >>> > >>> - adjust the testcase (is the final outcome still good?) > >> Without the patch, there is one fdiv instruction less (below is the > >> assembly for recip_sqrt_mult_1.c, but for _5.c it’s analogous): > >> No patch or with single_use of the result of the division: > >> foo: > >> fmov d30, 1.0e+0 > >> fsqrt d31, d0 > >> adrp x0, .LANCHOR0 > >> add x1, x0, :lo12:.LANCHOR0 > >> fdiv d30, d30, d0 > >> fmul d0, d31, d30 > >> str d0, [x0, #:lo12:.LANCHOR0] > >> stp d30, d31, [x1, 8] > >> ret > >> > >> With patch: > >> foo: > >> fsqrt d31, d0 > >> fmov d30, 1.0e+0 > >> adrp x1, .LANCHOR0 > >> add x0, x1, :lo12:.LANCHOR0 > >> fdiv d31, d30, d31 > >> fdiv d30, d30, d0 > >> str d31, [x1, #:lo12:.LANCHOR0] > >> fmul d31, d31, d0 > >> stp d30, d31, [x0, 8] > >> ret > >> So, we might want to use the single_use guard. > > > > Yeah, this is because the powi inline expansion will add back > > the division. > Below is the updated patch, I re-validated with no regression on aarch64 and > x86_64. > Thanks, > Jenni > > This patch adds the following two simplifications in match.pd for > POW_ALL and POWI: > - pow (1.0/x, y) to pow (x, -y), avoiding the division > - pow (0.0, x) to 0.0, avoiding the call to pow. > The patterns are guarded by flag_unsafe_math_optimizations, > !flag_trapping_math, and !HONOR_INFINITIES. > The POW_ALL patterns are also gated under !flag_errno_math. > The second pattern is also guarded by !HONOR_NANS and > !HONOR_SIGNED_ZEROS. > > Tests were added to confirm the application of the transform for > builtins pow, powf, powl, powi, powif, powil, and powf16. > > The patch was bootstrapped and regtested on aarch64-linux-gnu and > x86_64-linux-gnu, no regression. > OK for mainline?
OK. Thanks, Richard. > Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> > > gcc/ > * match.pd: Fold pow (1.0/x, y) -> pow (x, -y) and > pow (0.0, x) -> 0.0. > > gcc/testsuite/ > * gcc.dg/tree-ssa/pow_fold_1.c: New test. > --- > gcc/match.pd | 28 +++++++++++++++ > gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c | 42 ++++++++++++++++++++++ > 2 files changed, 70 insertions(+) > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c > > diff --git a/gcc/match.pd b/gcc/match.pd > index 12d81fcac0d..6d9868d2bb1 100644 > --- a/gcc/match.pd > +++ b/gcc/match.pd > @@ -8203,6 +8203,21 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > (rdiv @0 (exps:s @1)) > (mult @0 (exps (negate @1))))) > > + (for pow (POW_ALL) > + (if (! HONOR_INFINITIES (type) > + && ! flag_trapping_math > + && ! flag_errno_math) > + /* Simplify pow(1.0/x, y) into pow(x, -y). */ > + (simplify > + (pow (rdiv:s real_onep@0 @1) @2) > + (pow @1 (negate @2))) > + > + /* Simplify pow(0.0, x) into 0.0. */ > + (if (! HONOR_NANS (type) && ! HONOR_SIGNED_ZEROS (type)) > + (simplify > + (pow real_zerop@0 @1) > + @0)))) > + > (if (! HONOR_SIGN_DEPENDENT_ROUNDING (type) > && ! HONOR_NANS (type) && ! HONOR_INFINITIES (type) > && ! flag_trapping_math > @@ -8561,6 +8576,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > (mult (POW:s @0 @1) (POW:s @2 @1)) > (POW (mult @0 @2) @1)) > > + (if (! HONOR_INFINITIES (type) && ! flag_trapping_math) > + /* Simplify powi(1.0/x, y) into powi(x, -y). */ > + (simplify > + (POWI (rdiv@3 real_onep@0 @1) @2) > + (if (single_use (@3)) > + (POWI @1 (negate @2)))) > + > + /* Simplify powi(0.0, x) into 0.0. */ > + (if (! HONOR_NANS (type) && ! HONOR_SIGNED_ZEROS (type)) > + (simplify > + (POWI real_zerop@0 @1) > + @0))) > + > /* Simplify powi(x,y) * powi(z,y) -> powi(x*z,y). */ > (simplify > (mult (POWI:s @0 @1) (POWI:s @2 @1)) > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c > b/gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c > new file mode 100644 > index 00000000000..d98bcb0827e > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c > @@ -0,0 +1,42 @@ > +/* { dg-do compile } */ > +/* { dg-options "-Ofast -fdump-tree-optimized -fexcess-precision=16" } */ > +/* { dg-add-options float16 } */ > +/* { dg-require-effective-target float16_runtime } */ > +/* { dg-require-effective-target c99_runtime } */ > + > +extern void link_error (void); > + > +#define POW1OVER(TYPE1, TYPE2, CTY, TY) \ > + void \ > + pow1over_##TY (TYPE1 x, TYPE2 y) \ > + { \ > + TYPE1 t1 = 1.0##CTY / x; \ > + TYPE1 t2 = __builtin_pow##TY (t1, y); \ > + TYPE2 t3 = -y; \ > + TYPE1 t4 = __builtin_pow##TY (x, t3); \ > + if (t2 != t4) \ > + link_error (); \ > + } \ > + > +#define POW0(TYPE1, TYPE2, CTY, TY) \ > + void \ > + pow0_##TY (TYPE2 x) \ > + { \ > + TYPE1 t1 = __builtin_pow##TY (0.0##CTY, x); \ > + if (t1 != 0.0##CTY) \ > + link_error (); \ > + } \ > + > +#define TEST_ALL(TYPE1, TYPE2, CTY, TY) \ > + POW1OVER (TYPE1, TYPE2, CTY, TY) \ > + POW0 (TYPE1, TYPE2, CTY, TY) > + > +TEST_ALL (double, double, , ) > +TEST_ALL (float, float, f, f) > +TEST_ALL (_Float16, _Float16, f16, f16) > +TEST_ALL (long double, long double, L, l) > +TEST_ALL (double, int, , i) > +TEST_ALL (float, int, f, if) > +TEST_ALL (long double, int, L, il) > + > +/* { dg-final { scan-tree-dump-not "link_error" "optimized" } } */ > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)