On 05/25/2017 03:45 AM, Thomas Koenig wrote:
Hello world,
the attached patch speeds up the library version of matmul for AMD chips
by selecting AVX128 instructions and, depending on which instructions
are supported, either FMA3 (aka FMA) or FMA4.
Jerry tested this on his AMD systems, and found a speedup vs. the
current code of around 10%.
I have been unable to test this on a Ryzen system (the new compile farm
machines won't accept my login yet). From the benchmarks I have read,
this method should also work fairly well on a Ryzen.
So, OK for trunk?
Yes, OK. Maybe test Ryzen first?
I just confirmed access to the Ryzen machines so I plan to get set up and test
there.
Time to start looking under the hood.
cat /proc/cpuinfo gives for flags:
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36
clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm
constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni
pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c
rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse
3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext
perfctr_l2 mwaitx hw_pstate vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap
clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf arat npt lbrv
svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter
pfthreshold avic overflow_recov succor smca