Evandro Menezes <e.mene...@samsung.com> wrote: > On 03/18/16 10:21, Wilco Dijkstra wrote: > > Hi Evandro, > > > >> For example, though this approximation is improves the performance > >> noticeably for DF on A57, for SF, not so much, if at all. > > I'm still skeptical that you ever can get any gain on scalars. I bet the > > only gain is on > > 4x vectorized floats. > > I created a simple test that loops around an inline asm version of the > Newton series using scalar insns and got these results on A57:
That's pure max throughput rather than answering the question whether it speeds up code that does real work. A test that loads an array of vectors and writes back the unit vectors would be a more realistic scenario. Note our testing showed rsqrt slows down various benchmarks: https://gcc.gnu.org/ml/gcc-patches/2016-01/msg00574.html. > If I understood you correctly, would something like coarse tuning flags > along with target-specific cost or parameters tables be what you have in > mind? Yes, the magic tuning flags can be coarse (on/off is good enough). If we can agree that these expansions are really only useful for 4x vectorized code and not much else then all we need is a function that enables it for those modes. Otherwise we would need per-CPU settings that select which expansions are enabled for which modes (not just single/double). Wilco