Hi Evandro, > For example, though this approximation is improves the performance > noticeably for DF on A57, for SF, not so much, if at all.
I'm still skeptical that you ever can get any gain on scalars. I bet the only gain is on 4x vectorized floats. So what I would like to see is this implemented in a more general way. We should be able choose whether to expand depending on the mode - including whether it is vectorized. For example enable on V4SFmode and maybe V2DFmode, but not on any scalars. Then we'd add new CPU tuning settings for division, sqrt and rsqrt (rather than adding lots of extra tune flags). Note the md file should call a function in aarch64.c to decide whether to expand or not (your division approximation patch makes the decision in the md file which does not seem a good idea). Wilco