On Thu, May 25, 2017 at 1:45 PM, Thomas Koenig <tkoe...@netcologne.de> wrote:
> Hello world,
>
> the attached patch speeds up the library version of matmul for AMD chips
> by selecting AVX128 instructions and, depending on which instructions
> are supported, either FMA3 (aka FMA) or FMA4.
>
> Jerry tested this on his AMD systems, and found a speedup vs. the
> current code of around 10%.
>
> I have been unable to test this on a Ryzen system (the new compile farm
> machines won't accept my login yet).  From the benchmarks I have read,
> this method should also work fairly well on a Ryzen.
>
> So, OK for trunk?

In some comments, you have -mprefer=avx128 whereas the option that gcc
understands is -mprefer-avx128. Also, have you verified that e.g.
contemporary Intel processors still use the avx256 codepath and don't
accidentally end up with avx128?

As for FMA4, are there sufficient numbers of processors supporting
FMA4 but not FMA3 around to justify bloating the library to support
them? I understood that this is only a single AMD CPU generation
("bulldozer" in 2011), the next one ("piledriver" in 2012) added FMA3
in addition to FMA4. And in the new Zen core (Ryzen, Epyc, etc.) AMD
has dropped support for FMA4 although there are reports that it will
still execute FMA4 for backward compatibility although it's no longer
advertised in CPUID, but in any case AMD seems to consider it a legacy
instruction that should not be used anymore (Intel never supported
it).


-- 
Janne Blomqvist

Reply via email to