Thanks - I will try to figure out how to do that. I will note, however, that the OpenBLAS FAQ suggests that OpenBLAS tries to avoid allocating threads to the same physical core on machines with hyper threading, so perhaps this is not the cause:
https://github.com/xianyi/OpenBLAS/blob/master/GotoBLAS_03FAQ.txt On Thursday, October 20, 2016 at 4:45:51 PM UTC-5, Stefan Karpinski wrote: > > I think Ralph is suggesting that you disable the CPU's hyperthreading if > you run this kind of code often. We've done that on our benchmarking > machines, for example. > > On Wed, Oct 19, 2016 at 11:47 PM, Thomas Covert <thom....@gmail.com > <javascript:>> wrote: > >> So are you suggesting that real numerical workloads under >> BLAS.set_num_threads(4) will indeed be faster than >> under BLAS.set_num_threads(2)? That hasn't been my experience and I >> figured the peakflops() example would constitute an MWE. Is there another >> workload you would suggest I try to figure out if this is just a peak >> flops() aberration or a real issue? >> >> >> On Wednesday, October 19, 2016 at 8:28:16 PM UTC-5, Ralph Smith wrote: >>> >>> At least 2 things contribute to erratic results from peakflops(). With >>> hyperthreading, the threads are not always put on separate cores. Secondly, >>> the measured time includes >>> the allocation of the result matrix, so garbage collection affects some >>> of the results. Most available advice says to disable hyperthreading on >>> dedicated number crunchers >>> (most full loads work slightly more efficiently without the extra >>> context switching). The GC issue seems to be a mistake, if "peak" is to be >>> taken seriously. >>> >>> On Wednesday, October 19, 2016 at 12:04:00 PM UTC-4, Thomas Covert wrote: >>>> >>>> I have a recent iMac with 4 logical cores (and 8 hyper threads). I >>>> would have thought that peakflops(N) for a large enough N should be >>>> increasing in the number of threads I allow BLAS to use. I do find that >>>> peakflops(N) with 1 thread is about half as high as peakflops(N) with 2 >>>> threads, but there is no gain to 4 threads. Are my expectations wrong >>>> here, or is it possible that BLAS is somehow configured incorrectly on my >>>> machine? In the example below, N = 6755, a number relevant for my work, >>>> but the results are similar with 5000 or 10000. >>>> >>>> here is my versioninfo() >>>> julia> versioninfo() >>>> Julia Version 0.5.0 >>>> Commit 3c9d753* (2016-09-19 18:14 UTC) >>>> Platform Info: >>>> System: Darwin (x86_64-apple-darwin15.6.0) >>>> CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz >>>> WORD_SIZE: 64 >>>> BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell) >>>> LAPACK: libopenblas >>>> LIBM: libopenlibm >>>> LLVM: libLLVM-3.7.1 (ORCJIT, haswell) >>>> >>>> here is an example peakflops() exercise: >>>> julia> BLAS.set_num_threads(1) >>>> >>>> julia> mean(peakflops(6755) for i=1:10) >>>> 5.225580459387056e10 >>>> >>>> julia> BLAS.set_num_threads(2) >>>> >>>> julia> mean(peakflops(6755) for i=1:10) >>>> 1.004317640281997e11 >>>> >>>> julia> BLAS.set_num_threads(4) >>>> >>>> julia> mean(peakflops(6755) for i=1:10) >>>> 9.838116463900085e10 >>>> >>>> >>>> >>>> >>>> >>>> >