Evandro Menezes <e.mene...@samsung.com> wrote: > > That's what I had in mind too, but around the approximation for x^-1/2 > and using masks for vector cases thusly: > > fcmne v3.4s, v0.4s, #0.0 > frsqrte v1.4s, v0.4s > fmul v2.4s, v1.4s, v1.4s > frsqrts v2.4s, v0.4s, v2.4s > fmul v1.4s, v1.4s, v2.4s > fmul v2.4s, v1.4s, v1.4s > frsqrts v2.4s, v0.4s, v2.4s > fmul v1.4s, v1.4s, v2.4s > and v1.4s, v3.4s > fmul v0.4s, v1.4s, v0.4s
That's possible but the overall latency is higher - according to exynos-1.md the above takes 44 cycles while my version would be 37. Wilco