On 05/25/2017 03:45 AM, Thomas Koenig wrote:
Hello world,

the attached patch speeds up the library version of matmul for AMD chips
by selecting AVX128 instructions and, depending on which instructions
are supported, either FMA3 (aka FMA) or FMA4.

Jerry tested this on his AMD systems, and found a speedup vs. the
current code of around 10%.

I have been unable to test this on a Ryzen system (the new compile farm
machines won't accept my login yet).  From the benchmarks I have read,
this method should also work fairly well on a Ryzen.

So, OK for trunk?

Yes, OK.  Maybe test Ryzen first?

I just confirmed access to the Ryzen machines so I plan to get set up and test there.

Time to start looking under the hood.

cat /proc/cpuinfo gives for flags:

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 mwaitx hw_pstate vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic overflow_recov succor smca

Reply via email to