Hi, > -----Original Message----- > From: pins...@gmail.com [mailto:pins...@gmail.com] > Sent: Monday, June 29, 2015 10:23 PM > To: Dr. Philipp Tomsich > Cc: James Greenhalgh; Kumar, Venkataramanan; Benedikt Huber; gcc- > patc...@gcc.gnu.org; Marcus Shawcroft; Ramana Radhakrishnan; Richard > Earnshaw > Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) > estimation in -ffast-math > > > > > > > On Jun 29, 2015, at 4:44 AM, Dr. Philipp Tomsich > <philipp.toms...@theobroma-systems.com> wrote: > > > > James, > > > >> On 29 Jun 2015, at 13:36, James Greenhalgh > <james.greenha...@arm.com> wrote: > >> > >>> On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar, Venkataramanan > wrote: > >>> > >>>> -----Original Message----- > >>>> From: Dr. Philipp Tomsich > >>>> [mailto:philipp.toms...@theobroma-systems.com] > >>>> Sent: Monday, June 29, 2015 2:17 PM > >>>> To: Kumar, Venkataramanan > >>>> Cc: pins...@gmail.com; Benedikt Huber; gcc-patches@gcc.gnu.org > >>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root > >>>> (rsqrt) estimation in -ffast-math > >>>> > >>>> Kumar, > >>>> > >>>> This does not come unexpected, as the initial estimation and each > >>>> iteration will add an architecturally-defined number of bits of > >>>> precision (ARMv8 guarantuees only a minimum number of bits > provided > >>>> per operation… the exact number is specific to each micro-arch, > though). > >>>> Depending on your architecture and on the required number of > >>>> precise bits by any given benchmark, one may see miscompares. > >>> > >>> True. > >> > >> I would be very uncomfortable with this approach. > > > > Same here. The default must be safe. Always. > > Unlike other architectures, we don’t have a problem with making the > > proper defaults for “safety”, as the ARMv8 ISA guarantees a minimum > > number of precise bits per iteration. > > > >> From Richard Biener's post in the thread Michael Matz linked earlier > >> in the thread: > >> > >> It would follow existing practice of things we allow in > >> -funsafe-math-optimizations. Existing practice in that we > >> want to allow -ffast-math use with common benchmarks we care > >> about. > >> > >> https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html > >> > >> With the solution you seem to be converging on (2-steps for some > >> microarchitectures, 3 for others), a binary generated for one > >> micro-arch may drop below a minimum guarantee of precision when run > >> on another. This seems to go against the spirit of the practice > >> above. I would only support adding this optimization to -Ofast if we > >> could keep to architectural guarantees of precision in the generated code > (i.e. 3-steps everywhere). > >> > >> I don't object to adding a "-mlow-precision-recip-sqrt" style option, > >> which would be off by default, would enable the 2-step mode, and > >> would need to be explicitly enabled (i.e. not implied by -mcpu=foo) > >> but I don't see what this buys you beyond the Gromacs boost (and even > >> there you would be creating an Invalid Run as optimization flags must > >> be applied across all workloads). > > > > Any flag that reduces precision (and thus breaks IEEE floating-point > > semantics) needs to be gated with an “unsafe” flag (i.e. one that is never > on by default). > > As a consequence, the “peak”-tuning for SPEC will turn this on… but > > barely anyone else would. > > > >> For the 3-step optimization, it is clear to me that for "generic" > >> tuning we don't want this to be enabled by default experimental > >> results and advice in this thread argues against it for thunderx and > >> cortex- > a57 targets. > >> However, enabling it based on the CPU tuning selected seems fine to me. > > > > I do not agree on this one, as I would like to see the safe form (i.e. > > 3 and 5 iterations respectively) to become the default. Most > > “server-type” chips should not see a performance regression, while it > > will be easier to optimise for this in hardware than for a > > (potentially microcoded) sqrt-instruction (and subsequent, dependent > divide). > > > > I have not heard anyone claim a performance regression (either on > > thunderx or on cortex-a57), but merely heard a “no speed-up”. > > Actually it does regress performance on thunderX, I just assumed that when > I said not going to be a win it was taken as a slow down. It regress gromacs > by > more than 10% on thunderX but I can't remember how much as i had > someone else run it. The latency difference is also over 40%; for example > single precision: 29 cycles with div (12) sqrt(17) directly vs 42 cycles with > the > rsqrte and 2 iterations of 2mul/rsqrts (double is 53 vs 60). That is huge > difference right there. ThunderX has a fast div and a fast sqrt for 32bit > and a > reasonable one for double. So again this is not just not a win but rather a > regression for thunderX. I suspect cortex-a57 is also true. > > Thanks, > Andrew >
Yes theoretically should be true for cortex-57 case as well. But I believe hardware pipelining with instruction scheduling in compiler helps a little for gromacs case ~3% to 4% with the original patch. I have not tested other FP benchmarks. As James said a flag -mlow-precision-recip-sqrt if allowed can be used as a peak flag. > > > > So I am strongly in favor of defaulting to the ‘safe’ number of > > iterations, even when compiling for a generic target. > > > > Best, > > Philipp. > > Regards, Venkat.