James, On 29 Jun 2015, at 13:36, James Greenhalgh <james.greenha...@arm.com> wrote: > > On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar, Venkataramanan wrote: >> >>> -----Original Message----- >>> From: Dr. Philipp Tomsich [mailto:philipp.toms...@theobroma-systems.com] >>> Sent: Monday, June 29, 2015 2:17 PM >>> To: Kumar, Venkataramanan >>> Cc: pins...@gmail.com; Benedikt Huber; gcc-patches@gcc.gnu.org >>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) >>> estimation in -ffast-math >>> >>> Kumar, >>> >>> This does not come unexpected, as the initial estimation and each iteration >>> will add an architecturally-defined number of bits of precision (ARMv8 >>> guarantuees only a minimum number of bits provided per operation… the >>> exact number is specific to each micro-arch, though). >>> Depending on your architecture and on the required number of precise bits >>> by any given benchmark, one may see miscompares. >> >> True. > > I would be very uncomfortable with this approach.
Same here. The default must be safe. Always. Unlike other architectures, we don’t have a problem with making the proper defaults for “safety”, as the ARMv8 ISA guarantees a minimum number of precise bits per iteration. > From Richard Biener's post in the thread Michael Matz linked earlier > in the thread: > > It would follow existing practice of things we allow in > -funsafe-math-optimizations. Existing practice in that we > want to allow -ffast-math use with common benchmarks we care > about. > > https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html > > With the solution you seem to be converging on (2-steps for some > microarchitectures, 3 for others), a binary generated for one micro-arch > may drop below a minimum guarantee of precision when run on another. This > seems to go against the spirit of the practice above. I would only support > adding this optimization to -Ofast if we could keep to architectural > guarantees of precision in the generated code (i.e. 3-steps everywhere). > > I don't object to adding a "-mlow-precision-recip-sqrt" style option, > which would be off by default, would enable the 2-step mode, and would > need to be explicitly enabled (i.e. not implied by -mcpu=foo) but I don't > see what this buys you beyond the Gromacs boost (and even there you would > be creating an Invalid Run as optimization flags must be applied across > all workloads). Any flag that reduces precision (and thus breaks IEEE floating-point semantics) needs to be gated with an “unsafe” flag (i.e. one that is never on by default). As a consequence, the “peak”-tuning for SPEC will turn this on… but barely anyone else would. > For the 3-step optimization, it is clear to me that for "generic" tuning > we don't want this to be enabled by default experimental results and advice > in this thread argues against it for thunderx and cortex-a57 targets. > However, enabling it based on the CPU tuning selected seems fine to me. I do not agree on this one, as I would like to see the safe form (i.e. 3 and 5 iterations respectively) to become the default. Most “server-type” chips should not see a performance regression, while it will be easier to optimise for this in hardware than for a (potentially microcoded) sqrt-instruction (and subsequent, dependent divide). I have not heard anyone claim a performance regression (either on thunderx or on cortex-a57), but merely heard a “no speed-up”. So I am strongly in favor of defaulting to the ‘safe’ number of iterations, even when compiling for a generic target. Best, Philipp.