Hi, Venkat. Since x^1/2 = x * x^-1/2, the Newton series can also be used for the regular square root with an extra multiplication, as it is done in x86. That's what I was trying to estimate below.
Cheers, -- Evandro Menezes Austin, TX > -----Original Message----- > From: Kumar, Venkataramanan [mailto:venkataramanan.ku...@amd.com] > Sent: Monday, July 20, 2015 2:53 > To: Evandro Menezes; pins...@gmail.com; 'Dr. Philipp Tomsich' > Cc: 'James Greenhalgh'; 'Benedikt Huber'; gcc-patches@gcc.gnu.org; 'Marcus > Shawcroft'; 'Ramana Radhakrishnan'; 'Richard Earnshaw' > Subject: RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) > estimation in -ffast-math > > Hi, > > I missed your email and noticed it this week. > > What does column 2 tests? Are you trying to implement square roots using > reciprocal estimate and step? > > But reciprocal square root using reciprocal estimate and (2 for fp 3 for dp) > step seems to be better that using fdiv and fsqrt in your case. > > Regards, > Venkat. > > > -----Original Message----- > > From: Evandro Menezes [mailto:e.mene...@samsung.com] > > Sent: Wednesday, July 15, 2015 3:45 AM > > To: Kumar, Venkataramanan; pins...@gmail.com; 'Dr. Philipp Tomsich' > > Cc: 'James Greenhalgh'; 'Benedikt Huber'; gcc-patches@gcc.gnu.org; > > 'Marcus Shawcroft'; 'Ramana Radhakrishnan'; 'Richard Earnshaw' > > Subject: RE: [PATCH] [aarch64] Implemented reciprocal square root > > (rsqrt) estimation in -ffast-math > > > > I ran a simple test on A57 rev. 0, looping a million times around > > sqrt{,f} and the respective series iterations with the values in the > > sequence 1..1000000 and got these results: > > > > sqrt(x): 36593844/s 1/sqrt(x): 18283875/s > > 3 Steps: 47922557/s 3 Steps: 49005194/s > > > > sqrtf(x): 143988480/s 1/sqrtf(x): 69516857/s > > 2 Steps: 78740157/s 2 Steps: 80385852/s > > > > I'm a bit surprised that the 3-iteration series for DP is faster than > > sqrt(), but not that it's much faster for the reciprocal of sqrt(). > > As for SP, the 2-iteration series is faster only for the reciprocal for > sqrtf(). > > > > There might still be some leg for this patch in real-world cases which > > I'd like to investigate. > > > > -- > > Evandro Menezes Austin, TX > > > > > > > -----Original Message----- > > > From: gcc-patches-ow...@gcc.gnu.org > > > [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Kumar, > > > Venkataramanan > > > Sent: Monday, June 29, 2015 13:50 > > > To: pins...@gmail.com; Dr. Philipp Tomsich > > > Cc: James Greenhalgh; Benedikt Huber; gcc-patches@gcc.gnu.org; > > > Marcus Shawcroft; Ramana Radhakrishnan; Richard Earnshaw > > > Subject: RE: [PATCH] [aarch64] Implemented reciprocal square root > > > (rsqrt) estimation in -ffast-math > > > > > > Hi, > > > > > > > -----Original Message----- > > > > From: pins...@gmail.com [mailto:pins...@gmail.com] > > > > Sent: Monday, June 29, 2015 10:23 PM > > > > To: Dr. Philipp Tomsich > > > > Cc: James Greenhalgh; Kumar, Venkataramanan; Benedikt Huber; gcc- > > > > patc...@gcc.gnu.org; Marcus Shawcroft; Ramana Radhakrishnan; > > Richard > > > > Earnshaw > > > > Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root > > > > (rsqrt) estimation in -ffast-math > > > > > > > > > > > > > > > > > > > > > > > > > On Jun 29, 2015, at 4:44 AM, Dr. Philipp Tomsich > > > > <philipp.toms...@theobroma-systems.com> wrote: > > > > > > > > > > James, > > > > > > > > > >> On 29 Jun 2015, at 13:36, James Greenhalgh > > > > <james.greenha...@arm.com> wrote: > > > > >> > > > > >>> On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar, > > > > >>> Venkataramanan > > > > wrote: > > > > >>> > > > > >>>> -----Original Message----- > > > > >>>> From: Dr. Philipp Tomsich > > > > >>>> [mailto:philipp.toms...@theobroma-systems.com] > > > > >>>> Sent: Monday, June 29, 2015 2:17 PM > > > > >>>> To: Kumar, Venkataramanan > > > > >>>> Cc: pins...@gmail.com; Benedikt Huber; > > > > >>>> gcc-patches@gcc.gnu.org > > > > >>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square > > > > >>>> root > > > > >>>> (rsqrt) estimation in -ffast-math > > > > >>>> > > > > >>>> Kumar, > > > > >>>> > > > > >>>> This does not come unexpected, as the initial estimation and > > > > >>>> each iteration will add an architecturally-defined number of > > > > >>>> bits of precision (ARMv8 guarantuees only a minimum number of > > > > >>>> bits > > > > provided > > > > >>>> per operation… the exact number is specific to each > > > > >>>> micro-arch, > > > > though). > > > > >>>> Depending on your architecture and on the required number of > > > > >>>> precise bits by any given benchmark, one may see miscompares. > > > > >>> > > > > >>> True. > > > > >> > > > > >> I would be very uncomfortable with this approach. > > > > > > > > > > Same here. The default must be safe. Always. > > > > > Unlike other architectures, we don’t have a problem with making > > > > > the proper defaults for “safety”, as the ARMv8 ISA guarantees a > > > > > minimum number of precise bits per iteration. > > > > > > > > > >> From Richard Biener's post in the thread Michael Matz linked > > > > >> earlier in the thread: > > > > >> > > > > >> It would follow existing practice of things we allow in > > > > >> -funsafe-math-optimizations. Existing practice in that we > > > > >> want to allow -ffast-math use with common benchmarks we care > > > > >> about. > > > > >> > > > > >> https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html > > > > >> > > > > >> With the solution you seem to be converging on (2-steps for > > > > >> some microarchitectures, 3 for others), a binary generated for > > > > >> one micro-arch may drop below a minimum guarantee of precision > > > > >> when run on another. This seems to go against the spirit of the > > > > >> practice above. I would only support adding this optimization > > > > >> to -Ofast if we could keep to architectural guarantees of > > > > >> precision in the generated code > > > > (i.e. 3-steps everywhere). > > > > >> > > > > >> I don't object to adding a "-mlow-precision-recip-sqrt" style > > > > >> option, which would be off by default, would enable the 2-step > > > > >> mode, and would need to be explicitly enabled (i.e. not implied > > > > >> by > > > > >> -mcpu=foo) but I don't see what this buys you beyond the > > > > >> Gromacs boost (and even there you would be creating an Invalid > > > > >> Run as optimization flags must be applied across all workloads). > > > > > > > > > > Any flag that reduces precision (and thus breaks IEEE > > > > > floating-point > > > > > semantics) needs to be gated with an “unsafe” flag (i.e. one > > > > > that is never > > > > on by default). > > > > > As a consequence, the “peak”-tuning for SPEC will turn this on… > > > > > but barely anyone else would. > > > > > > > > > >> For the 3-step optimization, it is clear to me that for "generic" > > > > >> tuning we don't want this to be enabled by default experimental > > > > >> results and advice in this thread argues against it for > > > > >> thunderx and cortex- > > > > a57 targets. > > > > >> However, enabling it based on the CPU tuning selected seems > > > > >> fine to > > me. > > > > > > > > > > I do not agree on this one, as I would like to see the safe form > (i.e. > > > > > 3 and 5 iterations respectively) to become the default. Most > > > > > “server-type” chips should not see a performance regression, > > > > > while it will be easier to optimise for this in hardware than > > > > > for a (potentially microcoded) sqrt-instruction (and subsequent, > > > > > dependent > > > > divide). > > > > > > > > > > I have not heard anyone claim a performance regression (either > > > > > on thunderx or on cortex-a57), but merely heard a “no speed-up”. > > > > > > > > Actually it does regress performance on thunderX, I just assumed > > > > that when I said not going to be a win it was taken as a slow down. > > > > It regress gromacs by more than 10% on thunderX but I can't > > > > remember how much as i had someone else run it. The latency > > > > difference is also over 40%; for example single precision: 29 > > > > cycles with div (12) > > > > sqrt(17) directly vs 42 cycles with the rsqrte and 2 iterations of > > > > 2mul/rsqrts (double is 53 vs 60). That is huge difference right > > > > there. ThunderX has a > > > fast div and a fast sqrt for 32bit and a > > > > reasonable one for double. So again this is not just not a win but > rather > > > a > > > > regression for thunderX. I suspect cortex-a57 is also true. > > > > > > > > Thanks, > > > > Andrew > > > > > > > > > > Yes theoretically should be true for cortex-57 case as well. But I > > > believe hardware pipelining with instruction scheduling in compiler > > > helps a little for gromacs case ~3% to 4% with the original patch. > > > > > > I have not tested other FP benchmarks. As James said a flag -mlow- > > > precision-recip-sqrt if allowed can be used as a peak flag. > > > > > > > > > > > > > So I am strongly in favor of defaulting to the ‘safe’ number of > > > > > iterations, even when compiling for a generic target. > > > > > > > > > > Best, > > > > > Philipp. > > > > > > > > > > > Regards, > > > Venkat.