https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82344
--- Comment #5 from Yulia Koval <julia.koval at intel dot com> --- (In reply to Richard Biener from comment #3) > So the newton-raphson step causes register pressure to increase and post > haswell this makes code slower than not using rsqrt (thus using sqrtf and a > division)? > > I wonder whether it would be profitable to SLP vectorize this (of course > we're not considering this because SLP vectorization is looking for stores). > SLP vectorization would need to do 4 (or 8 with avx256) vector inserts > and extracts but then could do the rsqrt and newton raphson together. > The argument computation to the sqrt also loop vectorizable and the ultimate > operands even come from continuous memory. One of the tricky parts would be > to see that the only first rsqrt arg is re-used and thus taking > rinv21 to rinv33 (8 rsqrts) for the vectorization is probably best. > > rinv11 = 1.0/sqrt(rsq11) > rinv21 = 1.0/sqrt(rsq21) > rinv31 = 1.0/sqrt(rsq31) > rinv12 = 1.0/sqrt(rsq12) > rinv22 = 1.0/sqrt(rsq22) > rinv32 = 1.0/sqrt(rsq32) > rinv13 = 1.0/sqrt(rsq13) > rinv23 = 1.0/sqrt(rsq23) > rinv33 = 1.0/sqrt(rsq33) > r11 = rsq11*rinv11 > > What does ICC do to this loop? > > I can confirm the regression on our tester (a Haswell machine btw). ICC generates vrsqrtps in this case.