On 31 July 2015 at 10:34, Ramana Radhakrishnan <ramana.radhakrish...@foss.arm.com> wrote: > I've tried this in the past and never been convinced that 2 iterations are > enough to get to stability with this given that the results are only precise > for 8 bits / iteration. Thus I've always believed you need 3 iterations > rather than 2 at which point I've never been sure that it's worth it. So the > testing that you've done with this currently is not enough for this to go > into the tree.
My understanding is that 2 iterations is sufficient for single precision floating point (although not for double precision), because each iteration of Newton-Raphson doubles the number of bits of accuracy. I haven't worked through the maths myself, but https://en.wikipedia.org/wiki/Division_algorithm#Newton.E2.80.93Raphson_division says "This squaring of the error at each iteration step — the so-called quadratic convergence of Newton–Raphson's method — has the effect that the number of correct digits in the result roughly doubles for every iteration, a property that becomes extremely valuable when the numbers involved have many digits" Therefore: vrecpe -> 8 bits of accuracy +1 iteration -> 16 bits of accuracy +2 iterations -> 32 bits of accuracy (but in reality limited to precision of 32bit float) Since 32 bits is much more accuracy than the 24 bits of precision in a single precision FP value, 2 iterations should be sufficient. > I'd like this to be tested on a couple of different AArch32 implementations > with a wider range of inputs to verify that the results are acceptable as > well as running something like SPEC2k(6) with atleast one iteration to ensure > correctness. I can't argue with confirming theory matches practice :) Some corner cases (eg numbers around FLT_MAX, FLT_MIN etc) may result in denormals or out of range values during the reciprocal calculation which could result in answers which are less accurate than the typical case but I think that is acceptable with -ffast-math. Charles