For what it's worth, I get 37000 Mflops from the dgemm.goto benchmark
using the current Guix openblas and OPENBLAS_NUM_THREADS=1 at a size of
7000 on a laptop with "i5-6200U CPU @ 2.30GHz" (avx2).  That looks about
right, and it should more-or-less plateau at that size.  For comparison,
I get 44000 on a cluster node "E5-2690 v3 @ 2.60GHz" with its serial
build of 0.2.19.  (I mis-remembered the sandybridge figures, which
should be low 20s, not high 20s.)

If you see something much different, perhaps the performance counters
give a clue, e.g. with Guix' scorep/cube, oprofile, or perf.

I've sent a patch for the correct cache size on haswell, but I don't
think it makes much difference in this case.


Reply via email to