For what it's worth, I get 37000 Mflops from the dgemm.goto benchmark using the current Guix openblas and OPENBLAS_NUM_THREADS=1 at a size of 7000 on a laptop with "i5-6200U CPU @ 2.30GHz" (avx2). That looks about right, and it should more-or-less plateau at that size. For comparison, I get 44000 on a cluster node "E5-2690 v3 @ 2.60GHz" with its serial build of 0.2.19. (I mis-remembered the sandybridge figures, which should be low 20s, not high 20s.)
If you see something much different, perhaps the performance counters give a clue, e.g. with Guix' scorep/cube, oprofile, or perf. I've sent a patch for the correct cache size on haswell, but I don't think it makes much difference in this case.