At least with my experience on a mac, I've never seen real linear algebra code (not just peakflps) in Julia + OpenBLAS saturate more than 2 cores, even when setting the thread count to 4 on a machine with 4 real cores. When I try similar code on a linux machine I have access to, I never have a problem saturating as many real cores as are available, which makes me think that somehow the BLAS + threading situation on the mac version of Julia is not quite right.
On Friday, October 21, 2016 at 10:05:04 PM UTC-5, Ralph Smith wrote: > > On looking more carefully, I believe I was mistaken about thread > assignment to cores - that seems to be done well in OpenBLAS (and maybe > Linux in general nowadays). Perhaps the erratic benchmarks under > hyperthreading - even after heap management is tamed - arise when the > operating system detects idle virtual cores and schedules disruptive > processes there. > > On Friday, October 21, 2016 at 12:09:07 AM UTC-4, Ralph Smith wrote: >> >> That's interesting, I see the code in OpenBLAS. However, on the Linux >> systems I use, when I had hyperthreading enabled the allocations looked >> random, and I generally got less consistent benchmarks. I'll have to check >> that again. >> >> You can also avoid the memory allocation effects by something like >> using BenchmarkTools >> a = rand(n,n); b=rand(n,n); c = similar(a); >> @benchmark A_mul_B!($c,$a,$b) >> >> Of course this is only directly relevant to your real workload if that is >> dominated by sections where you can optimize away allocations and memory >> latency. >> >> >> On Thursday, October 20, 2016 at 11:00:41 PM UTC-4, Thomas Covert wrote: >>> >>> Thanks - I will try to figure out how to do that. I will note, however, >>> that the OpenBLAS FAQ suggests that OpenBLAS tries to avoid allocating >>> threads to the same physical core on machines with hyper threading, so >>> perhaps this is not the cause: >>> >>> https://github.com/xianyi/OpenBLAS/blob/master/GotoBLAS_03FAQ.txt >>> >>> >>> >>> On Thursday, October 20, 2016 at 4:45:51 PM UTC-5, Stefan Karpinski >>> wrote: >>>> >>>> I think Ralph is suggesting that you disable the CPU's hyperthreading >>>> if you run this kind of code often. We've done that on our benchmarking >>>> machines, for example. >>>> >>>> On Wed, Oct 19, 2016 at 11:47 PM, Thomas Covert <thom....@gmail.com> >>>> wrote: >>>> >>>>> So are you suggesting that real numerical workloads under >>>>> BLAS.set_num_threads(4) will indeed be faster than >>>>> under BLAS.set_num_threads(2)? That hasn't been my experience and I >>>>> figured the peakflops() example would constitute an MWE. Is there >>>>> another >>>>> workload you would suggest I try to figure out if this is just a peak >>>>> flops() aberration or a real issue? >>>>> >>>>> >>>>> On Wednesday, October 19, 2016 at 8:28:16 PM UTC-5, Ralph Smith wrote: >>>>>> >>>>>> At least 2 things contribute to erratic results from peakflops(). >>>>>> With hyperthreading, the threads are not always put on separate cores. >>>>>> Secondly, the measured time includes >>>>>> the allocation of the result matrix, so garbage collection affects >>>>>> some of the results. Most available advice says to disable >>>>>> hyperthreading >>>>>> on dedicated number crunchers >>>>>> (most full loads work slightly more efficiently without the extra >>>>>> context switching). The GC issue seems to be a mistake, if "peak" is to >>>>>> be >>>>>> taken seriously. >>>>>> >>>>>> On Wednesday, October 19, 2016 at 12:04:00 PM UTC-4, Thomas Covert >>>>>> wrote: >>>>>>> >>>>>>> I have a recent iMac with 4 logical cores (and 8 hyper threads). I >>>>>>> would have thought that peakflops(N) for a large enough N should be >>>>>>> increasing in the number of threads I allow BLAS to use. I do find >>>>>>> that >>>>>>> peakflops(N) with 1 thread is about half as high as peakflops(N) with 2 >>>>>>> threads, but there is no gain to 4 threads. Are my expectations wrong >>>>>>> here, or is it possible that BLAS is somehow configured incorrectly on >>>>>>> my >>>>>>> machine? In the example below, N = 6755, a number relevant for my >>>>>>> work, >>>>>>> but the results are similar with 5000 or 10000. >>>>>>> >>>>>>> here is my versioninfo() >>>>>>> julia> versioninfo() >>>>>>> Julia Version 0.5.0 >>>>>>> Commit 3c9d753* (2016-09-19 18:14 UTC) >>>>>>> Platform Info: >>>>>>> System: Darwin (x86_64-apple-darwin15.6.0) >>>>>>> CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz >>>>>>> WORD_SIZE: 64 >>>>>>> BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell) >>>>>>> LAPACK: libopenblas >>>>>>> LIBM: libopenlibm >>>>>>> LLVM: libLLVM-3.7.1 (ORCJIT, haswell) >>>>>>> >>>>>>> here is an example peakflops() exercise: >>>>>>> julia> BLAS.set_num_threads(1) >>>>>>> >>>>>>> julia> mean(peakflops(6755) for i=1:10) >>>>>>> 5.225580459387056e10 >>>>>>> >>>>>>> julia> BLAS.set_num_threads(2) >>>>>>> >>>>>>> julia> mean(peakflops(6755) for i=1:10) >>>>>>> 1.004317640281997e11 >>>>>>> >>>>>>> julia> BLAS.set_num_threads(4) >>>>>>> >>>>>>> julia> mean(peakflops(6755) for i=1:10) >>>>>>> 9.838116463900085e10 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>