https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #23 from rguenther at suse dot de <rguenther at suse dot de> --- On Tue, 22 Jan 2019, elrodc at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 > > --- Comment #22 from Chris Elrod <elrodc at gmail dot com> --- > Okay. I did that, and the time went from about 4.25 microseconds down to 4.0 > microseconds. So that is an improvement, but accounts for only a small part of > the difference with the LLVM-compilers. > > -O3 -fno-math-errno > > was about 3.5 microseconds, so -funsafe-math-optimizations still results in a > regression in this code. > > 3.5 microseconds is roughly as fast as you can get with vsqrt and div. > > My best guess now is that gcc does a lot more to improve the accuracy of > vsqrt. > If I understand correctly, these are all the involved instructions: > > vmovaps .LC2(%rip), %zmm7 > vmovaps .LC3(%rip), %zmm6 > # for loop begins > vrsqrt14ps %zmm1, %zmm2 # comparison and mask removed > vmulps %zmm1, %zmm2, %zmm0 > vmulps %zmm2, %zmm0, %zmm1 > vmulps %zmm6, %zmm0, %zmm0 > vaddps %zmm7, %zmm1, %zmm1 > vmulps %zmm0, %zmm1, %zmm1 > vrcp14ps %zmm1, %zmm0 > vmulps %zmm1, %zmm0, %zmm1 > vmulps %zmm1, %zmm0, %zmm1 > vaddps %zmm0, %zmm0, %zmm0 > vsubps %zmm1, %zmm0, %zmm0 > vfnmadd213ps (%r10,%rax), %zmm0, %zmm2 > > If I understand this correctly: > > zmm2 =(approx) 1 / sqrt(zmm1) > zmm0 = zmm1 * zmm2 = (approx) sqrt(zmm1) > zmm1 = zmm0 * zmm2 = (approx) 1 > zmm0 = zmm6 * zmm0 = (approx) constant6 * sqrt(zmm1) > zmm1 = zmm7 * zmm1 = (approx) constant7 > zmm1 = zmm0 * zmm1 = (approx) constant6 * constant6 * sqrt(zmm1) > zmm0 = (approx) 1 / zmm1 = (approx) 1 / sqrt(zmm1) * 1 / (constant6 * > constant7) > zmm1 = zmm1 * zmm0 = (approx) 1 > zmm1 = zmm1 * zmm0 = (approx) 1 / sqrt(zmm1) * 1 / (constant6 * constant7) > zmm0 = 2 * zmm0 = (approx) 2 / sqrt(zmm1) * 2 / (constant6 * constant7) > zmm0 = zmm1 - zmm0 = (approx) -1 / sqrt(zmm1) * 1 / (constant6 * constant7) > > which implies that constant6 * constant6 = approximately -1? GCC implements /* sqrt(a) = -0.5 * a * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) rsqrt(a) = -0.5 * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */ which looks similar to what LLVM does. You can look at the -fdump-tree-optimized dump to see if there's anything fishy. > > LLVM seems to do a much simpler / briefer update of the output of vrsqrt. > > When I implemented a vrsqrt intrinsic in a Julia library, I just looked at > Wikipedia and did (roughly): > > constant1 = -0.5 > constant2 = 1.5 > > zmm2 = (approx) 1 / sqrt(zmm1) > zmm3 = constant * zmm1 > zmm1 = zmm2 * zmm2 > zmm3 = zmm3 * zmm1 + constant2 > zmm2 = zmm2 * zmm3 > > > I am not a numerical analyst, so I can't comment on relative validities or > accuracies of these approaches. > I also don't know what LLVM 7+ does. LLVM 6 doesn't use vrsqrt. > > I would be interesting in reading explanations or discussions, if any are > available. > >