Hi all,
GCC has some logic to expand calls to pow (x, 0.75), pow (0.25) and pow (x,
(int)k + 0.5)
using square roots. So, for the above examples it would generate sqrt (x) *
sqrt (sqrt (x)),
sqrt (sqrt (x)) and powi (x, k) * sqrt (x) (assuming k > 0. For k < 0 it will
calculate the
reciprocal of that).
However, the implementation of these optimisations is done on a bit of an
ad-hoc basis with
the 0.25, 0.5, 0.75 cases hardcoded.
Judging by
https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=meissner2.pdf
these are the most commonly used exponents (at least in SPEC ;))
This patch generalises this optimisation into a (hopefully) more robust
algorithm.
In particular, it expands calls to pow (x, CST) by expanding the integer part
of CST
using a powi, like it does already, and then expanding the fractional part as a
product
of repeated applications of a square root if the fractional part can be
expressed
as a multiple of a power of 0.5.
I try to explain the algorithm in more detail in the comments in the patch but,
for example:
pow (x, 5.625) is not currently handled, but with this patch will be expanded
to powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))) because 5.625 == 5.0 + 0.5 +
0.5**3
Negative exponents are handled in either of two ways, depending on the exponent
value:
* Using a simple reciprocal.
For example:
pow (x, -5.625) == 1.0 / pow (x, 5.625)
--> 1.0 / (powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))))
* For pow (x, EXP) with negative exponent EXP with integer part INT and
fractional part FRAC:
pow (1.0 - FRAC) / powi (ceil (abs (EXP))).
For example:
pow (x, -5.875) == pow (x, 0.125) / powi (X, 6)
--> sqrt (sqrt (sqrt (x))) / (powi (x, 6))
Since hardware square root instructions tend to be expensive, we may want to
reduce the number
of square roots we are willing to calculate. Since we reuse intermediate square
root results,
this boils down to restricting the depth of the square root chains. In all the
examples above
that depth is 3. I've made this maximum depth parametrisable in params.def. By
adjusting that
parameter we can adjust the resolution of this optimisation. So, if it's set to
'4' then we
will synthesize every exponent that is a multiple of 0.5**4 == 0.0625,
including negative
multiples. Currently, GCC will not try to expand negative multiples of anything
else than 0.5
I have tried to keep the existing functionality intact and activate this only
for
-funsafe-math-optimizations and only when the target has a sqrt instruction.
An exception to that is pow (x, 0.5) which we prefer to transform to sqrt
even
when a hardware sqrt is not available, presumably because the library function
for
sqrt is usually faster than pow (?).
Having seen the glibc implementation of a fully IEEE-754-compliant pow
function, I think we
would prefer synthesising the pow call whenever we can for -ffast-math.
I have seen this optimisation trigger a few times in SPEC2k6, in particular in
447.dealII
and 481.wrf where it replaced calls to powf (x, -0.25), pow (x, 0.125) and pow
(x, 0.875)
with square roots, multiplies and, in the case of -0.25, divides.
On 481.wrf I saw it remove a total of 22 out of 322 calls to pow
On 481.wrf on aarch64 I saw about a 1% improvement.
The cycle count on x86_64 was also smaller, but not by a significant amount
(the same calls to
pow were eliminated).
In general, I think this can shine if multiple expandable calls to pow appear
together.
So, for example for code:
double
baz (double a)
{
return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow
(a, 3.375);
}
we can generate:
baz:
fsqrt d3, d0
fmul d4, d0, d0
fmov d5, 1.0e+0
fmul d6, d0, d4
fsqrt d2, d3
fmul d1, d0, d2
fsqrt d0, d2
fmul d3, d3, d2
fdiv d1, d5, d1
fmul d3, d3, d6
fmul d2, d2, d0
fmadd d0, d4, d3, d1
fmsub d0, d6, d2, d0
ret
reusing the sqrt results and doing more optimisations rather than the current:
baz:
stp x29, x30, [sp, -48]!
fmov d1, -1.25e+0
add x29, sp, 0
stp d8, d9, [sp, 16]
fmov d9, d0
str d10, [sp, 32]
bl pow
fmov d8, d0
fmov d0, d9
fmov d1, 5.75e+0
bl pow
fmov d10, d0
fmov d0, d9
fmov d1, 3.375e+0
bl pow
fadd d8, d8, d10
ldr d10, [sp, 32]
fsub d0, d8, d0
ldp d8, d9, [sp, 16]
ldp x29, x30, [sp], 48
ret
Of course gcc could already do that if the exponents were to fall in the set
{0.25, 0.75, k+0.5}
but with this patch that set can be greatly expanded.
I suppose if we're really lucky we might even open up new vectorisation
opportunities.
For example:
void
vecfoo (float *a, float *b)
{
for (int i = 0; i < N; i++)
a[i] = __builtin_powf (a[i], 1.25f) - __builtin_powf (b[i], 3.625);
}
will now be vectorisable if we have a hardware vector sqrt instruction. Though
I'm not sure
how often this would appear in real code.
I would like advice on some implementation details:
- The new function representable_as_half_series_p is used to check if the
fractional
part of an exponent can be represented as a multiple of a power of 0.5. It does
so
by using REAL_VALUE_TYPE arithmetic and a loop. I wonder whether there is a
smarter
way of doing this, considering that REAL_VALUE_TYPE holds the bit
representation of the
floating point number?
- Are there any correctness cases that I may have missed? This patch gates the
optimisation
on -funsafe-math-optimizations, but maybe there are some edge cases that I
missed?
- What should be the default value of the max-pow-sqrt-depth param? In this
patch it's 5 which
on second thought strikes me as a bit aggressive. To catch exponents that are
multiples of
0.25 we need it to be 2. For multiples of 0.125 it has to be 3 etc... I suppose
it depends on
how likely more fine-grained powers are to appear in real code, how expensive
the C library
implementation of pow is, and how expensive are the sqrt instructions in
hardware.
Bootstrapped and tested on x86_64, aarch64, arm (pending)
SPEC2k6 built and ran fine. Can anyone suggest anything other codebase that
might
be of interest to look at?
Thanks,
Kyrill
2015-05-01 Kyrylo Tkachov <kyrylo.tkac...@arm.com>
* params.def (PARAM_MAX_POW_SQRT_DEPTH): New param.
* tree-ssa-math-opts.c: Include params.h
(pow_synth_sqrt_info): New struct.
(representable_as_half_series_p): New function.
(get_fn_chain): Likewise.
(print_nested_fn): Likewise.
(dump_fractional_sqrt_sequence): Likewise.
(dump_integer_part): Likewise.
(expand_pow_as_sqrts): Likewise.
(gimple_expand_builtin_pow): Use above to attempt to expand
pow as series of square roots. Removed now unused variables.
2015-05-01 Kyrylo Tkachov <kyrylo.tkac...@arm.com>
* gcc.dg/pow-sqrt.x: New file.
* gcc.dg/pow-sqrt-1.c: New test.
* gcc.dg/pow-sqrt-2.c: Likewise.
* gcc.dg/pow-sqrt-3.c: Likewise.