Ping.
https://gcc.gnu.org/ml/gcc-patches/2015-05/msg00071.html

Thanks,
Kyrill
On 01/05/15 17:02, Kyrill Tkachov wrote:
Hi all,

GCC has some logic to expand calls to pow (x, 0.75), pow (0.25) and pow (x, 
(int)k + 0.5)
using square roots. So, for the above examples it would generate sqrt (x) * 
sqrt (sqrt (x)),
sqrt (sqrt (x)) and powi (x, k) * sqrt (x) (assuming k > 0. For k < 0 it will 
calculate the
reciprocal of that).

However, the implementation of these optimisations is done on a bit of an 
ad-hoc basis with
the 0.25, 0.5, 0.75 cases hardcoded.
Judging by 
https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=meissner2.pdf
these are the most commonly used exponents (at least in SPEC ;))

This patch generalises this optimisation into a (hopefully) more robust 
algorithm.
In particular, it expands calls to pow (x, CST) by expanding the integer part 
of CST
using a powi, like it does already, and then expanding the fractional part as a 
product
of repeated applications of a square root if the fractional part can be 
expressed
as a multiple of a power of 0.5.

I try to explain the algorithm in more detail in the comments in the patch but, 
for example:

pow (x, 5.625) is not currently handled, but with this patch will be expanded
to powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))) because 5.625 == 5.0 + 0.5 + 
0.5**3

Negative exponents are handled in either of two ways, depending on the exponent 
value:
* Using a simple reciprocal.
    For example:
    pow (x, -5.625) == 1.0 / pow (x, 5.625)
      --> 1.0 / (powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))))

* For pow (x, EXP) with negative exponent EXP with integer part INT and 
fractional part FRAC:
pow (1.0 - FRAC) / powi (ceil (abs (EXP))).
    For example:
    pow (x, -5.875) == pow (x, 0.125) / powi (X, 6)
      --> sqrt (sqrt (sqrt (x))) / (powi (x, 6))


Since hardware square root instructions tend to be expensive, we may want to 
reduce the number
of square roots we are willing to calculate. Since we reuse intermediate square 
root results,
this boils down to restricting the depth of the square root chains. In all the 
examples above
that depth is 3. I've made this maximum depth parametrisable in params.def. By 
adjusting that
parameter we can adjust the resolution of this optimisation. So, if it's set to 
'4' then we
will synthesize every exponent that is a multiple of 0.5**4 == 0.0625, 
including negative
multiples. Currently, GCC will not try to expand negative multiples of anything 
else than 0.5

I have tried to keep the existing functionality intact and activate this only 
for
-funsafe-math-optimizations and only when the target has a sqrt instruction.
   An exception to that is pow (x, 0.5) which we prefer to transform to sqrt 
even
when a hardware sqrt is not available, presumably because the library function 
for
sqrt is usually faster than pow (?).


Having seen the glibc implementation of a fully IEEE-754-compliant pow 
function, I think we
would prefer synthesising the pow call whenever we can for -ffast-math.

I have seen this optimisation trigger a few times in SPEC2k6, in particular in 
447.dealII
and 481.wrf where it replaced calls to powf (x, -0.25), pow (x, 0.125) and pow 
(x, 0.875)
with square roots, multiplies and, in the case of -0.25, divides.
On 481.wrf I saw it remove a total of 22 out of 322 calls to pow

On 481.wrf on aarch64 I saw about a 1% improvement.
The cycle count on x86_64 was also smaller, but not by a significant amount 
(the same calls to
pow were eliminated).

In general, I think this can shine if multiple expandable calls to pow appear 
together.
So, for example for code:
double
baz (double a)
{
    return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow 
(a, 3.375);
}

we can generate:
baz:
          fsqrt   d3, d0
          fmul    d4, d0, d0
          fmov    d5, 1.0e+0
          fmul    d6, d0, d4
          fsqrt   d2, d3
          fmul    d1, d0, d2
          fsqrt   d0, d2
          fmul    d3, d3, d2
          fdiv    d1, d5, d1
          fmul    d3, d3, d6
          fmul    d2, d2, d0
          fmadd   d0, d4, d3, d1
          fmsub   d0, d6, d2, d0
          ret

reusing the sqrt results and doing more optimisations rather than the current:
baz:
          stp     x29, x30, [sp, -48]!
          fmov    d1, -1.25e+0
          add     x29, sp, 0
          stp     d8, d9, [sp, 16]
          fmov    d9, d0
          str     d10, [sp, 32]
          bl      pow
          fmov    d8, d0
          fmov    d0, d9
          fmov    d1, 5.75e+0
          bl      pow
          fmov    d10, d0
          fmov    d0, d9
          fmov    d1, 3.375e+0
          bl      pow
          fadd    d8, d8, d10
          ldr     d10, [sp, 32]
          fsub    d0, d8, d0
          ldp     d8, d9, [sp, 16]
          ldp     x29, x30, [sp], 48
          ret


Of course gcc could already do that if the exponents were to fall in the set 
{0.25, 0.75, k+0.5}
but with this patch that set can be greatly expanded.

I suppose if we're really lucky we might even open up new vectorisation 
opportunities.
For example:
void
vecfoo (float *a, float *b)
{
    for (int i = 0; i < N; i++)
      a[i] = __builtin_powf (a[i], 1.25f) - __builtin_powf (b[i], 3.625);
}

will now be vectorisable if we have a hardware vector sqrt instruction. Though 
I'm not sure
how often this would appear in real code.


I would like advice on some implementation details:
- The new function representable_as_half_series_p is used to check if the 
fractional
part of an exponent can be represented as a multiple of a power of 0.5. It does 
so
by using REAL_VALUE_TYPE arithmetic and a loop. I wonder whether there is a 
smarter
way of doing this, considering that REAL_VALUE_TYPE holds the bit 
representation of the
floating point number?

- Are there any correctness cases that I may have missed? This patch gates the 
optimisation
on -funsafe-math-optimizations, but maybe there are some edge cases that I 
missed?

- What should be the default value of the max-pow-sqrt-depth param? In this 
patch it's 5 which
on second thought strikes me as a bit aggressive. To catch exponents that are 
multiples of
0.25 we need it to be 2. For multiples of 0.125 it has to be 3 etc... I suppose 
it depends on
how likely more fine-grained powers are to appear in real code, how expensive 
the C library
implementation of pow is, and how expensive are the sqrt instructions in 
hardware.


Bootstrapped and tested on x86_64, aarch64, arm (pending)
SPEC2k6 built and ran fine. Can anyone suggest anything other codebase that 
might
be of interest to look at?

Thanks,
Kyrill

2015-05-01  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>

      * params.def (PARAM_MAX_POW_SQRT_DEPTH): New param.
      * tree-ssa-math-opts.c: Include params.h
      (pow_synth_sqrt_info): New struct.
      (representable_as_half_series_p): New function.
      (get_fn_chain): Likewise.
      (print_nested_fn): Likewise.
      (dump_fractional_sqrt_sequence): Likewise.
      (dump_integer_part): Likewise.
      (expand_pow_as_sqrts): Likewise.
      (gimple_expand_builtin_pow): Use above to attempt to expand
      pow as series of square roots.  Removed now unused variables.

2015-05-01  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>

      * gcc.dg/pow-sqrt.x: New file.
      * gcc.dg/pow-sqrt-1.c: New test.
      * gcc.dg/pow-sqrt-2.c: Likewise.
      * gcc.dg/pow-sqrt-3.c: Likewise.

Reply via email to