At least with my experience on a mac, I've never seen real linear algebra 
code (not just peakflps) in Julia + OpenBLAS saturate more than 2 cores, 
even when setting the thread count to 4 on a machine with 4 real cores. 
 When I try similar code on a linux machine I have access to, I never have 
a problem saturating as many real cores as are available, which makes me 
think that somehow the BLAS + threading situation on the mac version of 
Julia is not quite right.  

On Friday, October 21, 2016 at 10:05:04 PM UTC-5, Ralph Smith wrote:
>
> On looking more carefully, I believe I was mistaken about thread 
> assignment to cores - that seems to be done well in OpenBLAS (and maybe 
> Linux in general nowadays).  Perhaps the erratic benchmarks under 
> hyperthreading - even after heap management is tamed - arise when the 
> operating system detects idle virtual cores and schedules disruptive 
> processes there.
>
> On Friday, October 21, 2016 at 12:09:07 AM UTC-4, Ralph Smith wrote:
>>
>> That's interesting, I see the code in OpenBLAS. However, on the Linux 
>> systems I use, when I had hyperthreading enabled the allocations looked 
>> random, and I generally got less consistent benchmarks.  I'll have to check 
>> that again.
>>
>> You can also avoid the memory allocation effects by something like
>> using BenchmarkTools
>> a = rand(n,n); b=rand(n,n); c = similar(a);
>> @benchmark A_mul_B!($c,$a,$b)
>>
>> Of course this is only directly relevant to your real workload if that is 
>> dominated by sections where you can optimize away allocations and memory 
>> latency.
>>
>>
>> On Thursday, October 20, 2016 at 11:00:41 PM UTC-4, Thomas Covert wrote:
>>>
>>> Thanks - I will try to figure out how to do that.  I will note, however, 
>>> that the OpenBLAS FAQ suggests that OpenBLAS tries to avoid allocating 
>>> threads to the same physical core on machines with hyper threading, so 
>>> perhaps this is not the cause:
>>>
>>> https://github.com/xianyi/OpenBLAS/blob/master/GotoBLAS_03FAQ.txt
>>>
>>>
>>>
>>> On Thursday, October 20, 2016 at 4:45:51 PM UTC-5, Stefan Karpinski 
>>> wrote:
>>>>
>>>> I think Ralph is suggesting that you disable the CPU's hyperthreading 
>>>> if you run this kind of code often. We've done that on our benchmarking 
>>>> machines, for example.
>>>>
>>>> On Wed, Oct 19, 2016 at 11:47 PM, Thomas Covert <thom....@gmail.com> 
>>>> wrote:
>>>>
>>>>> So are you suggesting that real numerical workloads under 
>>>>> BLAS.set_num_threads(4) will indeed be faster than 
>>>>> under BLAS.set_num_threads(2)?  That hasn't been my experience and I 
>>>>> figured the peakflops() example would constitute an MWE.  Is there 
>>>>> another 
>>>>> workload you would suggest I try to figure out if this is just a peak 
>>>>> flops() aberration or a real issue?
>>>>>
>>>>>
>>>>> On Wednesday, October 19, 2016 at 8:28:16 PM UTC-5, Ralph Smith wrote:
>>>>>>
>>>>>> At least 2 things contribute to erratic results from peakflops(). 
>>>>>> With hyperthreading, the threads are not always put on separate cores. 
>>>>>> Secondly, the measured time includes
>>>>>> the allocation of the result matrix, so garbage collection affects 
>>>>>> some of the results.  Most available advice says to disable 
>>>>>> hyperthreading 
>>>>>> on dedicated number crunchers
>>>>>> (most full loads work slightly more efficiently without the extra 
>>>>>> context switching).  The GC issue seems to be a mistake, if "peak" is to 
>>>>>> be 
>>>>>> taken seriously.
>>>>>>
>>>>>> On Wednesday, October 19, 2016 at 12:04:00 PM UTC-4, Thomas Covert 
>>>>>> wrote:
>>>>>>>
>>>>>>> I have a recent iMac with 4 logical cores (and 8 hyper threads).  I 
>>>>>>> would have thought that peakflops(N) for a large enough N should be 
>>>>>>> increasing in the number of threads I allow BLAS to use.  I do find 
>>>>>>> that 
>>>>>>> peakflops(N) with 1 thread is about half as high as peakflops(N) with 2 
>>>>>>> threads, but there is no gain to 4 threads.  Are my expectations wrong 
>>>>>>> here, or is it possible that BLAS is somehow configured incorrectly on 
>>>>>>> my 
>>>>>>> machine?  In the example below, N = 6755, a number relevant for my 
>>>>>>> work, 
>>>>>>> but the results are similar with 5000 or 10000.
>>>>>>>
>>>>>>> here is my versioninfo()
>>>>>>> julia> versioninfo()
>>>>>>> Julia Version 0.5.0
>>>>>>> Commit 3c9d753* (2016-09-19 18:14 UTC)
>>>>>>> Platform Info:
>>>>>>>   System: Darwin (x86_64-apple-darwin15.6.0)
>>>>>>>   CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
>>>>>>>   WORD_SIZE: 64
>>>>>>>   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
>>>>>>>   LAPACK: libopenblas
>>>>>>>   LIBM: libopenlibm
>>>>>>>   LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
>>>>>>>
>>>>>>> here is an example peakflops() exercise:
>>>>>>> julia> BLAS.set_num_threads(1)
>>>>>>>
>>>>>>> julia> mean(peakflops(6755) for i=1:10)
>>>>>>> 5.225580459387056e10
>>>>>>>
>>>>>>> julia> BLAS.set_num_threads(2)
>>>>>>>
>>>>>>> julia> mean(peakflops(6755) for i=1:10)
>>>>>>> 1.004317640281997e11
>>>>>>>
>>>>>>> julia> BLAS.set_num_threads(4)
>>>>>>>
>>>>>>> julia> mean(peakflops(6755) for i=1:10)
>>>>>>> 9.838116463900085e10
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>

Reply via email to