https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82344

--- Comment #5 from Yulia Koval <julia.koval at intel dot com> ---
(In reply to Richard Biener from comment #3)
> So the newton-raphson step causes register pressure to increase and post
> haswell this makes code slower than not using rsqrt (thus using sqrtf and a
> division)?
> 
> I wonder whether it would be profitable to SLP vectorize this (of course
> we're not considering this because SLP vectorization is looking for stores).
> SLP vectorization would need to do 4 (or 8 with avx256) vector inserts
> and extracts but then could do the rsqrt and newton raphson together.
> The argument computation to the sqrt also loop vectorizable and the ultimate
> operands even come from continuous memory.  One of the tricky parts would be
> to see that the only first rsqrt arg is re-used and thus taking
> rinv21 to rinv33 (8 rsqrts) for the vectorization is probably best.
> 
>           rinv11           = 1.0/sqrt(rsq11)
>           rinv21           = 1.0/sqrt(rsq21)
>           rinv31           = 1.0/sqrt(rsq31)
>           rinv12           = 1.0/sqrt(rsq12)
>           rinv22           = 1.0/sqrt(rsq22)
>           rinv32           = 1.0/sqrt(rsq32)
>           rinv13           = 1.0/sqrt(rsq13)
>           rinv23           = 1.0/sqrt(rsq23)
>           rinv33           = 1.0/sqrt(rsq33)
>           r11              = rsq11*rinv11
> 
> What does ICC do to this loop?
> 
> I can confirm the regression on our tester (a Haswell machine btw).

ICC generates vrsqrtps in this case.

Reply via email to