For both FRECPE and FRSQRTE the ARMv8 ISA guide states in their pseudo-code that:
"Result is double-precision and a multiple of 1/256 in the range 1 to 511/256." This suggests that the estimate is merely 8 bits long. IIRC, x86 returns 12 bits for its equivalent insns, requiring then a single series iteration for both SP and DP to achieve a precise enough result. -- Evandro Menezes Austin, TX > -----Original Message----- > From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On > Behalf Of Dr. Philipp Tomsich > Sent: Monday, June 29, 2015 3:47 > To: Kumar, Venkataramanan > Cc: pins...@gmail.com; Benedikt Huber; gcc-patches@gcc.gnu.org > Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) > estimation in -ffast-math > > Kumar, > > This does not come unexpected, as the initial estimation and each iteration > will add an architecturally-defined number of bits of precision (ARMv8 > guarantuees only a minimum number of bits provided per operation… the exact > number is specific to each micro-arch, though). > Depending on your architecture and on the required number of precise bits by > any given benchmark, one may see miscompares. > > Do you know the exact number of bits that the initial estimate and the > subsequent refinement steps add for your micro-arch? > > Thanks, > Philipp. > > > On 29 Jun 2015, at 10:17, Kumar, Venkataramanan > <venkataramanan.ku...@amd.com> wrote: > > > > > > Hmm, Reducing the iterations to "1 step for float" and "2 steps for > double" > > > > I got VE (miscompares) on following benchmarks 416.gamess > > 453.povray > > 454.calculix > > 459.GemsFDTD > > > > Benedikt , I have ICE for 444.namd with your patch, not sure if something > wrong in my local tree. > > > > Regards, > > Venkat. > > > >> -----Original Message----- > >> From: pins...@gmail.com [mailto:pins...@gmail.com] > >> Sent: Sunday, June 28, 2015 8:35 PM > >> To: Kumar, Venkataramanan > >> Cc: Dr. Philipp Tomsich; Benedikt Huber; gcc-patches@gcc.gnu.org > >> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root > >> (rsqrt) estimation in -ffast-math > >> > >> > >> > >> > >> > >>> On Jun 25, 2015, at 9:44 AM, Kumar, Venkataramanan > >> <venkataramanan.ku...@amd.com> wrote: > >>> > >>> I got around ~12% gain with -Ofast -mcpu=cortex-a57. > >> > >> I get around 11/12% on thunderX with the patch and the decreasing the > >> iterations change (1/2) compared to without the patch. > >> > >> Thanks, > >> Andrew > >> > >> > >>> > >>> Regards, > >>> Venkat. > >>> > >>>> -----Original Message----- > >>>> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches- > >>>> ow...@gcc.gnu.org] On Behalf Of Dr. Philipp Tomsich > >>>> Sent: Thursday, June 25, 2015 9:13 PM > >>>> To: Kumar, Venkataramanan > >>>> Cc: Benedikt Huber; pins...@gmail.com; gcc-patches@gcc.gnu.org > >>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root > >>>> (rsqrt) estimation in -ffast-math > >>>> > >>>> Kumar, > >>>> > >>>> what is the relative gain that you see on Cortex-A57? > >>>> > >>>> Thanks, > >>>> Philipp. > >>>> > >>>>>> On 25 Jun 2015, at 17:35, Kumar, Venkataramanan > >>>>> <venkataramanan.ku...@amd.com> wrote: > >>>>> > >>>>> Changing to "1 step for float" and "2 steps for double" gives > >>>>> better gains > >>>> now for gromacs on cortex-a57. > >>>>> > >>>>> Regards, > >>>>> Venkat. > >>>>>> -----Original Message----- > >>>>>> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches- > >>>>>> ow...@gcc.gnu.org] On Behalf Of Benedikt Huber > >>>>>> Sent: Thursday, June 25, 2015 4:09 PM > >>>>>> To: pins...@gmail.com > >>>>>> Cc: gcc-patches@gcc.gnu.org; philipp.tomsich@theobroma- > >> systems.com > >>>>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root > >>>>>> (rsqrt) estimation in -ffast-math > >>>>>> > >>>>>> Andrew, > >>>>>> > >>>>>>> This is NOT a win on thunderX at least for single precision > >>>>>>> because you have > >>>>>> to do the divide and sqrt in the same time as it takes 5 > >>>>>> multiples (estimate and step are multiplies in the thunderX pipeline). > >>>>>> Doubles is 10 multiplies which is just the same as what the patch > >>>>>> does (but it is really slightly less than 10, I rounded up). So > >>>>>> in the end this is NOT a win at all for thunderX unless we do one > >>>>>> less step for both single > >>>> and double. > >>>>>> > >>>>>> Yes, the expected benefit from rsqrt estimation is implementation > >>>>>> specific. If one has a better initial rsqrte or an application > >>>>>> that can trade precision for execution time, we could offer a > >>>>>> command line option to do only 2 steps for doulbe and 1 step for > >>>>>> float; similar to - > >>>> mrecip-precision for PowerPC. > >>>>>> What are your thoughts on that? > >>>>>> > >>>>>> Best regards, > >>>>>> Benedikt > >>>