------- Comment #14 from rguenth at gcc dot gnu dot org 2007-06-10 12:07 ------- The interesting difference between sqrtss, divss and rcpss, rsqrtss is that the former have throughput of 1/16 while the latter are 1/1 (latencies compare 21 vs. 3). This is on K10. The optimization guide only mentions calculating the reciprocal y = a/b via rcpss and the square root (!) via rsqrtss (sqrt a = 0.5 * a * rsqrtss(a) * (3.0 - a * rsqrtss(a) * rsqrtss(a)))
So the optimization would be mainly to improve instruction throughput, not overall latency. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723