Evandro Menezes <e.mene...@samsung.com> wrote:
> On 03/18/16 10:21, Wilco Dijkstra wrote:
> > Hi Evandro,
> >
> >> For example, though this approximation is improves the performance
> >> noticeably for DF on A57, for SF, not so much, if at all.
> > I'm still skeptical that you ever can get any gain on scalars. I bet the 
> > only gain is on
> > 4x vectorized floats.
>
> I created a simple test that loops around an inline asm version of the
> Newton series using scalar insns and got these results on A57:

That's pure max throughput rather than answering the question whether
it speeds up code that does real work. A test that loads an array of vectors and
writes back the unit vectors would be a more realistic scenario.

Note our testing showed rsqrt slows down various benchmarks:
https://gcc.gnu.org/ml/gcc-patches/2016-01/msg00574.html.

> If I understood you correctly, would something like coarse tuning flags
> along with target-specific cost or parameters tables be what you have in
> mind?

Yes, the magic tuning flags can be coarse (on/off is good enough). If we can
agree that these expansions are really only useful for 4x vectorized code and
not much else then all we need is a function that enables it for those modes. 
Otherwise we would need per-CPU settings that select which expansions are
enabled for which modes (not just single/double).

Wilco

Reply via email to