Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots (updated findings)

Jeff Squyres (jsquyres) Thu, 18 Sep 2014 09:34:09 -0400 (EDT)

Lane --

Can you confirm that adding numactl-devel and using --hetero-nodes fixed your 
problem?




On Sep 2, 2014, at 5:08 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Argh - yeah, I got confused as things context switched a few too many times. 
> The 1.8.2 release should certainly understand that arrangement, and 
> --hetero-nodes. The only way it wouldn't see the latter is if you configure 
> it --without-hwloc, or hwloc refused to build.
> 
> Since there was a question about the numactl-devel requirement, I suspect 
> that is the root cause of all evil in this case and the lack of 
> --hetero-nodes would confirm that diagnosis :-)
> 
> 
> 
> On Sep 2, 2014, at 1:49 PM, Lane, William <william.l...@cshs.org> wrote:
> 
>> Ralph,
>> 
>> These latest issues (since 8/28/14) all occurred after we upgraded our 
>> cluster
>> to OpenMPI 1.8.2 on . Maybe I should've created a new thread rather
>> than tacking on these issues to my existing thread.
>> 
>> -Bill Lane
>> 
>> ________________________________________
>> From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain 
>> [r...@open-mpi.org]
>> Sent: Tuesday, September 02, 2014 11:03 AM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request >  28      
>> slots   (updated findings)
>> 
>> On Sep 2, 2014, at 10:48 AM, Lane, William <william.l...@cshs.org> wrote:
>> 
>>> Ralph,
>>> 
>>> There are at least three different permutations of CPU configurations in 
>>> the cluster
>>> involved. Some are blades that have two sockets with two cores per Intel 
>>> CPU (and not all
>>> sockets are filled). Some are IBM x3550 systems having two sockets with 
>>> three cores
>>> per Intel CPU (and not all sockets are populated). All nodes have 
>>> hyperthreading turned
>>> on as well.
>>> 
>>> I will look into getting the numactl-devel package installed.
>>> 
>>> I will try the --bind-to none switch again. For some reason the 
>>> --hetero-nodes switch wasn't
>>> recognized by mpirun. Is the --hetero-nodes swtich an MCA parameter?
>> 
>> My bad - I forgot that you are using a very old OMPI version. I think you'll 
>> need to upgrade, though, as I don't believe something that old will know how 
>> to handle such a hybrid system. I suspect this may be at the bottom of the 
>> problem you are seeing.
>> 
>> You'll really need to get up to the 1.8 series, I'm afraid - I'm not sure 
>> even 1.6 can handle this setup.
>> 
>>> 
>>> Thanks for your help.
>>> 
>>> -Bill Lane
>>> ________________________________________
>>> From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain 
>>> [r...@open-mpi.org]
>>> Sent: Saturday, August 30, 2014 7:15 AM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28       
>>> slots   (updated findings)
>>> 
>>> hwloc requires the numactl-devel package in addition to the numactl one
>>> 
>>> If I understand the email thread correctly, it sounds like you have at 
>>> least some nodes in your system that have fewer cores than others - is that 
>>> correct?
>>> 
>>>>> Here are the definitions of the two parallel environments tested (with 
>>>>> orte always failing when
>>>>> more slots are requested than there are CPU cores on the first node 
>>>>> allocated to the job by
>>>>> SGE):
>>> 
>>> If that is the situation, then you need to add --hetero-nodes to your cmd 
>>> line so we look at the actual topology of every node. Otherwise, for 
>>> scalability reasons, we only look at the first node in the allocation and 
>>> assume all nodes are the same.
>>> 
>>> If that isn't the case, then it sounds like we are seeing fewer cores than 
>>> exist on the system for some reason. You could try installing hwloc 
>>> independently, and then running "lstopo" to find out what it detects. 
>>> Another thing you could do is add "-mca plm_base_verbose 100" to your cmd 
>>> line (I suggest doing that with just a couple of nodes in your allocation) 
>>> and that will dump the detected topology to stderr.
>>> 
>>> I'm surprised the bind-to none option didn't remove the error - it 
>>> definitely should as we won't be binding when that is given. However, I 
>>> note that you misspelled it in your reply, so maybe you just didn't type it 
>>> correctly? It is "--bind-to none" - note the space between the "to" and the 
>>> "none". You'll take a performance hit, but it should at least run.
>>> 
>>> 
>>> 
>>> On Aug 29, 2014, at 11:29 PM, Lane, William <william.l...@cshs.org> wrote:
>>> 
>>>> The --bind-to-none switch didn't help, I'm still getting the same errors.
>>>> 
>>>> The only NUMA package installed on the nodes in this CentOS 6.2 cluster is 
>>>> the
>>>> following:
>>>> 
>>>> numactl-2.0.7-3.el6.x86_64
>>>> this package is described as: numactl.x86_64 : Library for tuning for Non 
>>>> Uniform Memory Access machines
>>>> 
>>>> Since many of these systems are NUMA systems (with separate memory address 
>>>> spaces
>>>> for the sockets) could it be that the correct NUMA libraries aren't 
>>>> installed?
>>>> 
>>>> Here are some of the other NUMA packages available for CentOS 6.x:
>>>> 
>>>> yum search numa | less
>>>> 
>>>>             Loaded plugins: fastestmirror
>>>>             Loading mirror speeds from cached hostfile
>>>>             ============================== N/S Matched: numa 
>>>> ===============================
>>>>             numactl-devel.i686 : Development package for building 
>>>> Applications that use numa
>>>>             numactl-devel.x86_64 : Development package for building 
>>>> Applications that use
>>>>                                  : numa
>>>>             numad.x86_64 : NUMA user daemon
>>>>             numactl.i686 : Library for tuning for Non Uniform Memory 
>>>> Access machines
>>>>             numactl.x86_64 : Library for tuning for Non Uniform Memory 
>>>> Access machines
>>>> 
>>>> -Bill Lane
>>>> ________________________________________
>>>> From: users [users-boun...@open-mpi.org] on behalf of Reuti 
>>>> [re...@staff.uni-marburg.de]
>>>> Sent: Thursday, August 28, 2014 3:27 AM
>>>> To: Open MPI Users
>>>> Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots 
>>>> (updated findings)
>>>> 
>>>> Am 28.08.2014 um 10:09 schrieb Lane, William:
>>>> 
>>>>> I have some updates on these issues and some test results as well.
>>>>> 
>>>>> We upgraded OpenMPI to the latest version 1.8.2, but when submitting jobs 
>>>>> via the SGE orte parallel environment received
>>>>> errors whenever more slots are requested than there are actual cores on 
>>>>> the first node allocated to the job.
>>>> 
>>>> Does "-bind-to none" help? The binding is switched on by default in Open 
>>>> MPI 1.8 onwards.
>>>> 
>>>> 
>>>>> The btl tcp,self switch passed to mpirun made significant differences in 
>>>>> performance as per the below:
>>>>> 
>>>>> Even with the oversubscribe option, the memory mapping errors still 
>>>>> persist. On 32 core nodes and with HPL run compiled for openmpi/1.8.2,  
>>>>> it reliably starts failing at 20 cores allocated. Note that I tested with 
>>>>> 'btl tcp,self' defined and it does slow down the solve by 2 on a quick 
>>>>> solve. The results on a larger solve would probably be more dramatic:
>>>>> - Quick HPL 16 core with SM: ~19GFlops
>>>>> - Quick HPL 16 core without SM: ~10GFlops
>>>>> 
>>>>> Unfortunately, a recompiled HPL did not work, but it did give us more 
>>>>> information (error below). Still trying a couple things.
>>>>> 
>>>>> A request was made to bind to that would result in binding more
>>>>> processes than cpus on a resource:
>>>>> 
>>>>> Bind to:     CORE
>>>>> Node:        csclprd3-0-7
>>>>> #processes:  2
>>>>> #cpus:       1
>>>>> 
>>>>> You can override this protection by adding the "overload-allowed"
>>>>> option to your binding directive.
>>>>> 
>>>>> When using the SGE make parallel environment to submit jobs everything 
>>>>> worked perfectly.
>>>>> I noticed when using the make PE, the number of slots allocated from each 
>>>>> node to the job
>>>>> corresponded to the number of CPU's and disregarded any additional cores 
>>>>> within a CPU and
>>>>> any hyperthreading cores.
>>>> 
>>>> For SGE the hyperthreading cores count as normal cores. In principle it's 
>>>> possible to have an RQS defined in SGE (`qconf -srqsl`) which will limit 
>>>> the number of cores for the "make" PE, or (better) limit it in each 
>>>> exechost defintion to the physical installed ones (this is what I set up 
>>>> usually - maybe leaving hyperthreading switched on gives some room for the 
>>>> kernel processes this way).
>>>> 
>>>> 
>>>>> Here are the definitions of the two parallel environments tested (with 
>>>>> orte always failing when
>>>>> more slots are requested than there are CPU cores on the first node 
>>>>> allocated to the job by
>>>>> SGE):
>>>>> 
>>>>> [root@csclprd3 ~]# qconf -sp orte
>>>>> pe_name            orte
>>>>> slots              9999
>>>>> user_lists         NONE
>>>>> xuser_lists        NONE
>>>>> start_proc_args    /bin/true
>>>>> stop_proc_args     /bin/true
>>>>> allocation_rule    $fill_up
>>>>> control_slaves     TRUE
>>>>> job_is_first_task  FALSE
>>>>> urgency_slots      min
>>>>> accounting_summary TRUE
>>>>> qsort_args         NONE
>>>>> 
>>>>> [root@csclprd3 ~]# qconf -sp make
>>>>> pe_name            make
>>>>> slots              999
>>>>> user_lists         NONE
>>>>> xuser_lists        NONE
>>>>> start_proc_args    NONE
>>>>> stop_proc_args     NONE
>>>>> allocation_rule    $round_robin
>>>>> control_slaves     TRUE
>>>>> job_is_first_task  FALSE
>>>>> urgency_slots      min
>>>>> accounting_summary TRUE
>>>>> qsort_args         NONE
>>>>> 
>>>>> Although everything seems to work with the make PE, I'd still like
>>>>> to know why? Because on a much older version of openMPI loaded
>>>>> on an older version of CentOS, SGE and ROCKS, using all physical
>>>>> cores, as well as all hyperthreads was never a problem (even on NUMA
>>>>> nodes).
>>>>> 
>>>>> What is the recommended SGE parallel environment definition for
>>>>> OpenMPI 1.8.2?
>>>> 
>>>> Whether you prefer $fill_up or $round_robin is up to you - do you prefer 
>>>> all your processes on the least amount of machines or spread around in the 
>>>> cluster. If there is much communication maybe it's better on less 
>>>> machines, but if each process has heavy I/O to the local scratch disk 
>>>> spreading it around may be the preferred choice. This doesn't make any 
>>>> difference to Open MPI, as the generated $PE_HOSTFILE contains just the 
>>>> list of granted slots. Doing it in an $fill_up style will of course fill 
>>>> the first node including the hyperthreading ones before moving to the next 
>>>> machine (`man sge_pe`).
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> I apologize for the length of this, but I thought it best to provide more
>>>>> information than less.
>>>>> 
>>>>> Thank you in advance,
>>>>> 
>>>>> -Bill Lane
>>>>> 
>>>>> ________________________________________
>>>>> From: users [users-boun...@open-mpi.org] on behalf of Jeff Squyres 
>>>>> (jsquyres) [jsquy...@cisco.com]
>>>>> Sent: Friday, August 08, 2014 5:25 AM
>>>>> To: Open MPI User's List
>>>>> Subject: Re: [OMPI users] Mpirun 1.5.4  problems when request > 28 slots
>>>>> 
>>>>> On Aug 8, 2014, at 1:24 AM, Lane, William <william.l...@cshs.org> wrote:
>>>>> 
>>>>>> Using the "--mca btl tcp,self" switch to mpirun solved all the issues 
>>>>>> (in addition to
>>>>>> the requirement to include the --mca btl_tcp_if_include eth0 switch). I 
>>>>>> believe
>>>>>> the "--mca btl tcp,self" switch limits inter-process communication 
>>>>>> within a node to using the TCP
>>>>>> loopback rather than shared memory.
>>>>> 
>>>>> Correct.  You will not be using shared memory for MPI communication at 
>>>>> all -- just TCP.
>>>>> 
>>>>>> I should also point out that all of the nodes
>>>>>> on this cluster feature NUMA architecture.
>>>>>> 
>>>>>> Will using the "--mca btl tcp,self" switch to mpirun result in any 
>>>>>> degraded performance
>>>>>> issues over using shared memory?
>>>>> 
>>>>> Generally yes, but it depends on your application.  If your application 
>>>>> does very little MPI communication, then the difference between shared 
>>>>> memory and TCP is likely negligible.
>>>>> 
>>>>> I'd strongly suggest two things:
>>>>> 
>>>>> - Upgrade to at least Open MPI 1.6.5 (1.8.x would be better, if possible)
>>>>> - Run your program through a memory-checking debugger such as Valgrind
>>>>> 
>>>>> Seg faults like you initially described can be caused by errors in your 
>>>>> MPI application itself -- the fact that using TCP only (and not shared 
>>>>> memory) avoids the segvs does not mean that the issue is actually fixed; 
>>>>> it may well mean that the error is still there, but is happening in a 
>>>>> case that doesn't seem to cause enough damage to cause a segv.
>>>>> 
>>>>> --
>>>>> Jeff Squyres
>>>>> jsquy...@cisco.com
>>>>> For corporate legal information go to: 
>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2014/08/24951.php
>>>>> IMPORTANT WARNING: This message is intended for the use of the person or 
>>>>> entity to which it is addressed and may contain information that is 
>>>>> privileged and confidential, the disclosure of which is governed by 
>>>>> applicable law. If the reader of this message is not the intended 
>>>>> recipient, or the employee or agent responsible for delivering it to the 
>>>>> intended recipient, you are hereby notified that any dissemination, 
>>>>> distribution or copying of this information is STRICTLY PROHIBITED. If 
>>>>> you have received this message in error, please notify us immediately by 
>>>>> calling (310) 423-6428 and destroy the related message. Thank You for 
>>>>> your cooperation.
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25176.php
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2014/08/25179.php
>>>> IMPORTANT WARNING: This message is intended for the use of the person or 
>>>> entity to which it is addressed and may contain information that is 
>>>> privileged and confidential, the disclosure of which is governed by 
>>>> applicable law. If the reader of this message is not the intended 
>>>> recipient, or the employee or agent responsible for delivering it to the 
>>>> intended recipient, you are hereby notified that any dissemination, 
>>>> distribution or copying of this information is STRICTLY PROHIBITED. If you 
>>>> have received this message in error, please notify us immediately by 
>>>> calling (310) 423-6428 and destroy the related message. Thank You for your 
>>>> cooperation.
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2014/08/25202.php
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/08/25203.php
>>> IMPORTANT WARNING: This message is intended for the use of the person or 
>>> entity to which it is addressed and may contain information that is 
>>> privileged and confidential, the disclosure of which is governed by 
>>> applicable law. If the reader of this message is not the intended 
>>> recipient, or the employee or agent responsible for delivering it to the 
>>> intended recipient, you are hereby notified that any dissemination, 
>>> distribution or copying of this information is STRICTLY PROHIBITED. If you 
>>> have received this message in error, please notify us immediately by 
>>> calling (310) 423-6428 and destroy the related message. Thank You for your 
>>> cooperation.
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/09/25224.php
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/09/25226.php
>> IMPORTANT WARNING: This message is intended for the use of the person or 
>> entity to which it is addressed and may contain information that is 
>> privileged and confidential, the disclosure of which is governed by 
>> applicable law. If the reader of this message is not the intended recipient, 
>> or the employee or agent responsible for delivering it to the intended 
>> recipient, you are hereby notified that any dissemination, distribution or 
>> copying of this information is STRICTLY PROHIBITED. If you have received 
>> this message in error, please notify us immediately by calling (310) 
>> 423-6428 and destroy the related message. Thank You for your cooperation.
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/09/25229.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25231.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots (updated findings)

Reply via email to