On Fri, May 8, 2015 at 5:09 PM, Kyrill Tkachov <kyrylo.tkac...@foss.arm.com> wrote: > > On 08/05/15 14:56, Kyrill Tkachov wrote: >> >> On 08/05/15 11:18, Richard Biener wrote: >>> >>> On Fri, May 1, 2015 at 6:02 PM, Kyrill Tkachov >>> <kyrylo.tkac...@foss.arm.com> wrote: >>>> >>>> Hi all, >>>> >>>> GCC has some logic to expand calls to pow (x, 0.75), pow (0.25) and pow >>>> (x, >>>> (int)k + 0.5) >>>> using square roots. So, for the above examples it would generate sqrt >>>> (x) * >>>> sqrt (sqrt (x)), >>>> sqrt (sqrt (x)) and powi (x, k) * sqrt (x) (assuming k > 0. For k < 0 it >>>> will calculate the >>>> reciprocal of that). >>>> >>>> However, the implementation of these optimisations is done on a bit of >>>> an >>>> ad-hoc basis with >>>> the 0.25, 0.5, 0.75 cases hardcoded. >>>> Judging by >>>> >>>> https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=meissner2.pdf >>>> these are the most commonly used exponents (at least in SPEC ;)) >>>> >>>> This patch generalises this optimisation into a (hopefully) more robust >>>> algorithm. >>>> In particular, it expands calls to pow (x, CST) by expanding the integer >>>> part of CST >>>> using a powi, like it does already, and then expanding the fractional >>>> part >>>> as a product >>>> of repeated applications of a square root if the fractional part can be >>>> expressed >>>> as a multiple of a power of 0.5. >>>> >>>> I try to explain the algorithm in more detail in the comments in the >>>> patch >>>> but, for example: >>>> >>>> pow (x, 5.625) is not currently handled, but with this patch will be >>>> expanded >>>> to powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))) because 5.625 == 5.0 >>>> + >>>> 0.5 + 0.5**3 >>>> >>>> Negative exponents are handled in either of two ways, depending on the >>>> exponent value: >>>> * Using a simple reciprocal. >>>> For example: >>>> pow (x, -5.625) == 1.0 / pow (x, 5.625) >>>> --> 1.0 / (powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x)))) >>>> >>>> * For pow (x, EXP) with negative exponent EXP with integer part INT and >>>> fractional part FRAC: >>>> pow (1.0 - FRAC) / powi (ceil (abs (EXP))). >>>> For example: >>>> pow (x, -5.875) == pow (x, 0.125) / powi (X, 6) >>>> --> sqrt (sqrt (sqrt (x))) / (powi (x, 6)) >>>> >>>> >>>> Since hardware square root instructions tend to be expensive, we may >>>> want to >>>> reduce the number >>>> of square roots we are willing to calculate. Since we reuse intermediate >>>> square root results, >>>> this boils down to restricting the depth of the square root chains. In >>>> all >>>> the examples above >>>> that depth is 3. I've made this maximum depth parametrisable in >>>> params.def. >>>> By adjusting that >>>> parameter we can adjust the resolution of this optimisation. So, if it's >>>> set >>>> to '4' then we >>>> will synthesize every exponent that is a multiple of 0.5**4 == 0.0625, >>>> including negative >>>> multiples. Currently, GCC will not try to expand negative multiples of >>>> anything else than 0.5 >>>> >>>> I have tried to keep the existing functionality intact and activate this >>>> only for >>>> -funsafe-math-optimizations and only when the target has a sqrt >>>> instruction. >>>> An exception to that is pow (x, 0.5) which we prefer to transform to >>>> sqrt >>>> even >>>> when a hardware sqrt is not available, presumably because the library >>>> function for >>>> sqrt is usually faster than pow (?). >>> >>> Yes. It's also a safe transform - which you seem to put under >>> flag_unsafe_math_optimizations only with your patch. >>> >>> It would be clearer to just leave the special-case >>> >>> - /* Optimize pow(x,0.5) = sqrt(x). This replacement is always safe >>> - unless signed zeros must be maintained. pow(-0,0.5) = +0, while >>> - sqrt(-0) = -0. */ >>> - if (sqrtfn >>> - && REAL_VALUES_EQUAL (c, dconsthalf) >>> - && !HONOR_SIGNED_ZEROS (mode)) >>> - return build_and_insert_call (gsi, loc, sqrtfn, arg0); >>> >>> in as-is. >> >> Ok, I'll leave that case explicit. >> >>> You also removed the Os constraint which you should put back in. >>> Basically if !optimize_function_for_speed_p then generate at most >>> two calls to sqrt (iff the HW has a sqrt instruction). >> >> I tried to move that logic into expand_with_sqrts but >> I'll move it outside it. It seems that this boils down to >> only 0.25, as any other 2xsqrt chain will also involve a >> multiply or a divide which we currently avoid. >> >>> You fail to add a testcase that checks that the optimization applies. >> >> I'll add one to scan the sincos dump. >> I notice that we don't have a testuite check that the target has >> a hw sqrt instructions. Would you like me to add one? Or can I make >> the testcase aarch64-specific?
Would be great to have a testsuite check for this. >> >>> Otherwise the idea looks good though there must be a better way >>> to compute the series than by using real-arithmetic and forcefully >>> trying out all possibilities... >> >> I get that feeling too. What I need is not only a way >> of figuring out if the fractional part of the exponent can be >> represented in this way, but also compute the depth of the >> sqrt chain and the number of multiplies... >> That being said, the current approach is O(maximum depth) and >> I don't expect the depth to go much beyond 3 or 4 in practice. >> >> Thanks for looking at it! >> I'll respin the patch. > > > And here it is, with my above comments implemented. > Bootstrapped on x86_64 and tested on aarch64. > Full testing on arm and aarch64 ongoing. > > Is this ok if testing comes clean? Ok. Thanks, Richard. > Thanks, > Kyrill > > >> Kyrill >> >>> Richard. >>> >>>> Having seen the glibc implementation of a fully IEEE-754-compliant pow >>>> function, I think we >>>> would prefer synthesising the pow call whenever we can for -ffast-math. >>>> >>>> I have seen this optimisation trigger a few times in SPEC2k6, in >>>> particular >>>> in 447.dealII >>>> and 481.wrf where it replaced calls to powf (x, -0.25), pow (x, 0.125) >>>> and >>>> pow (x, 0.875) >>>> with square roots, multiplies and, in the case of -0.25, divides. >>>> On 481.wrf I saw it remove a total of 22 out of 322 calls to pow >>>> >>>> On 481.wrf on aarch64 I saw about a 1% improvement. >>>> The cycle count on x86_64 was also smaller, but not by a significant >>>> amount >>>> (the same calls to >>>> pow were eliminated). >>>> >>>> In general, I think this can shine if multiple expandable calls to pow >>>> appear together. >>>> So, for example for code: >>>> double >>>> baz (double a) >>>> { >>>> return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - >>>> __builtin_pow >>>> (a, 3.375); >>>> } >>>> >>>> we can generate: >>>> baz: >>>> fsqrt d3, d0 >>>> fmul d4, d0, d0 >>>> fmov d5, 1.0e+0 >>>> fmul d6, d0, d4 >>>> fsqrt d2, d3 >>>> fmul d1, d0, d2 >>>> fsqrt d0, d2 >>>> fmul d3, d3, d2 >>>> fdiv d1, d5, d1 >>>> fmul d3, d3, d6 >>>> fmul d2, d2, d0 >>>> fmadd d0, d4, d3, d1 >>>> fmsub d0, d6, d2, d0 >>>> ret >>>> >>>> reusing the sqrt results and doing more optimisations rather than the >>>> current: >>>> baz: >>>> stp x29, x30, [sp, -48]! >>>> fmov d1, -1.25e+0 >>>> add x29, sp, 0 >>>> stp d8, d9, [sp, 16] >>>> fmov d9, d0 >>>> str d10, [sp, 32] >>>> bl pow >>>> fmov d8, d0 >>>> fmov d0, d9 >>>> fmov d1, 5.75e+0 >>>> bl pow >>>> fmov d10, d0 >>>> fmov d0, d9 >>>> fmov d1, 3.375e+0 >>>> bl pow >>>> fadd d8, d8, d10 >>>> ldr d10, [sp, 32] >>>> fsub d0, d8, d0 >>>> ldp d8, d9, [sp, 16] >>>> ldp x29, x30, [sp], 48 >>>> ret >>>> >>>> >>>> Of course gcc could already do that if the exponents were to fall in the >>>> set >>>> {0.25, 0.75, k+0.5} >>>> but with this patch that set can be greatly expanded. >>>> >>>> I suppose if we're really lucky we might even open up new vectorisation >>>> opportunities. >>>> For example: >>>> void >>>> vecfoo (float *a, float *b) >>>> { >>>> for (int i = 0; i < N; i++) >>>> a[i] = __builtin_powf (a[i], 1.25f) - __builtin_powf (b[i], >>>> 3.625); >>>> } >>>> >>>> will now be vectorisable if we have a hardware vector sqrt instruction. >>>> Though I'm not sure >>>> how often this would appear in real code. >>>> >>>> >>>> I would like advice on some implementation details: >>>> - The new function representable_as_half_series_p is used to check if >>>> the >>>> fractional >>>> part of an exponent can be represented as a multiple of a power of 0.5. >>>> It >>>> does so >>>> by using REAL_VALUE_TYPE arithmetic and a loop. I wonder whether there >>>> is a >>>> smarter >>>> way of doing this, considering that REAL_VALUE_TYPE holds the bit >>>> representation of the >>>> floating point number? >>>> >>>> - Are there any correctness cases that I may have missed? This patch >>>> gates >>>> the optimisation >>>> on -funsafe-math-optimizations, but maybe there are some edge cases that >>>> I >>>> missed? >>>> >>>> - What should be the default value of the max-pow-sqrt-depth param? In >>>> this >>>> patch it's 5 which >>>> on second thought strikes me as a bit aggressive. To catch exponents >>>> that >>>> are multiples of >>>> 0.25 we need it to be 2. For multiples of 0.125 it has to be 3 etc... I >>>> suppose it depends on >>>> how likely more fine-grained powers are to appear in real code, how >>>> expensive the C library >>>> implementation of pow is, and how expensive are the sqrt instructions in >>>> hardware. >>>> >>>> >>>> Bootstrapped and tested on x86_64, aarch64, arm (pending) >>>> SPEC2k6 built and ran fine. Can anyone suggest anything other codebase >>>> that >>>> might >>>> be of interest to look at? >>>> >>>> Thanks, >>>> Kyrill >>>> >>>> 2015-05-01 Kyrylo Tkachov <kyrylo.tkac...@arm.com> >>>> >>>> * params.def (PARAM_MAX_POW_SQRT_DEPTH): New param. >>>> * tree-ssa-math-opts.c: Include params.h >>>> (pow_synth_sqrt_info): New struct. >>>> (representable_as_half_series_p): New function. >>>> (get_fn_chain): Likewise. >>>> (print_nested_fn): Likewise. >>>> (dump_fractional_sqrt_sequence): Likewise. >>>> (dump_integer_part): Likewise. >>>> (expand_pow_as_sqrts): Likewise. >>>> (gimple_expand_builtin_pow): Use above to attempt to expand >>>> pow as series of square roots. Removed now unused variables. >>>> >>>> 2015-05-01 Kyrylo Tkachov <kyrylo.tkac...@arm.com> >>>> >>>> * gcc.dg/pow-sqrt.x: New file. >>>> * gcc.dg/pow-sqrt-1.c: New test. >>>> * gcc.dg/pow-sqrt-2.c: Likewise. >>>> * gcc.dg/pow-sqrt-3.c: Likewise. > >