Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots (updated findings)

Ralph Castain Tue, 2 Sep 2014 17:08:52 -0400 (EDT)

Argh - yeah, I got confused as things context switched a few too many times. 
The 1.8.2 release should certainly understand that arrangement, and 
--hetero-nodes. The only way it wouldn't see the latter is if you configure it 
--without-hwloc, or hwloc refused to build.


Since there was a question about the numactl-devel requirement, I suspect that 
is the root cause of all evil in this case and the lack of --hetero-nodes would 
confirm that diagnosis :-)



On Sep 2, 2014, at 1:49 PM, Lane, William <william.l...@cshs.org> wrote:

> Ralph,
> 
> These latest issues (since 8/28/14) all occurred after we upgraded our cluster
> to OpenMPI 1.8.2 on . Maybe I should've created a new thread rather
> than tacking on these issues to my existing thread.
> 
> -Bill Lane
> 
> ________________________________________
> From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain 
> [r...@open-mpi.org]
> Sent: Tuesday, September 02, 2014 11:03 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request >  28      slots 
>   (updated findings)
> 
> On Sep 2, 2014, at 10:48 AM, Lane, William <william.l...@cshs.org> wrote:
> 
>> Ralph,
>> 
>> There are at least three different permutations of CPU configurations in the 
>> cluster
>> involved. Some are blades that have two sockets with two cores per Intel CPU 
>> (and not all
>> sockets are filled). Some are IBM x3550 systems having two sockets with 
>> three cores
>> per Intel CPU (and not all sockets are populated). All nodes have 
>> hyperthreading turned
>> on as well.
>> 
>> I will look into getting the numactl-devel package installed.
>> 
>> I will try the --bind-to none switch again. For some reason the 
>> --hetero-nodes switch wasn't
>> recognized by mpirun. Is the --hetero-nodes swtich an MCA parameter?
> 
> My bad - I forgot that you are using a very old OMPI version. I think you'll 
> need to upgrade, though, as I don't believe something that old will know how 
> to handle such a hybrid system. I suspect this may be at the bottom of the 
> problem you are seeing.
> 
> You'll really need to get up to the 1.8 series, I'm afraid - I'm not sure 
> even 1.6 can handle this setup.
> 
>> 
>> Thanks for your help.
>> 
>> -Bill Lane
>> ________________________________________
>> From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain 
>> [r...@open-mpi.org]
>> Sent: Saturday, August 30, 2014 7:15 AM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28       
>> slots   (updated findings)
>> 
>> hwloc requires the numactl-devel package in addition to the numactl one
>> 
>> If I understand the email thread correctly, it sounds like you have at least 
>> some nodes in your system that have fewer cores than others - is that 
>> correct?
>> 
>>>> Here are the definitions of the two parallel environments tested (with 
>>>> orte always failing when
>>>> more slots are requested than there are CPU cores on the first node 
>>>> allocated to the job by
>>>> SGE):
>> 
>> If that is the situation, then you need to add --hetero-nodes to your cmd 
>> line so we look at the actual topology of every node. Otherwise, for 
>> scalability reasons, we only look at the first node in the allocation and 
>> assume all nodes are the same.
>> 
>> If that isn't the case, then it sounds like we are seeing fewer cores than 
>> exist on the system for some reason. You could try installing hwloc 
>> independently, and then running "lstopo" to find out what it detects. 
>> Another thing you could do is add "-mca plm_base_verbose 100" to your cmd 
>> line (I suggest doing that with just a couple of nodes in your allocation) 
>> and that will dump the detected topology to stderr.
>> 
>> I'm surprised the bind-to none option didn't remove the error - it 
>> definitely should as we won't be binding when that is given. However, I note 
>> that you misspelled it in your reply, so maybe you just didn't type it 
>> correctly? It is "--bind-to none" - note the space between the "to" and the 
>> "none". You'll take a performance hit, but it should at least run.
>> 
>> 
>> 
>> On Aug 29, 2014, at 11:29 PM, Lane, William <william.l...@cshs.org> wrote:
>> 
>>> The --bind-to-none switch didn't help, I'm still getting the same errors.
>>> 
>>> The only NUMA package installed on the nodes in this CentOS 6.2 cluster is 
>>> the
>>> following:
>>> 
>>> numactl-2.0.7-3.el6.x86_64
>>> this package is described as: numactl.x86_64 : Library for tuning for Non 
>>> Uniform Memory Access machines
>>> 
>>> Since many of these systems are NUMA systems (with separate memory address 
>>> spaces
>>> for the sockets) could it be that the correct NUMA libraries aren't 
>>> installed?
>>> 
>>> Here are some of the other NUMA packages available for CentOS 6.x:
>>> 
>>> yum search numa | less
>>> 
>>>              Loaded plugins: fastestmirror
>>>              Loading mirror speeds from cached hostfile
>>>              ============================== N/S Matched: numa 
>>> ===============================
>>>              numactl-devel.i686 : Development package for building 
>>> Applications that use numa
>>>              numactl-devel.x86_64 : Development package for building 
>>> Applications that use
>>>                                   : numa
>>>              numad.x86_64 : NUMA user daemon
>>>              numactl.i686 : Library for tuning for Non Uniform Memory 
>>> Access machines
>>>              numactl.x86_64 : Library for tuning for Non Uniform Memory 
>>> Access machines
>>> 
>>> -Bill Lane
>>> ________________________________________
>>> From: users [users-boun...@open-mpi.org] on behalf of Reuti 
>>> [re...@staff.uni-marburg.de]
>>> Sent: Thursday, August 28, 2014 3:27 AM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots 
>>> (updated findings)
>>> 
>>> Am 28.08.2014 um 10:09 schrieb Lane, William:
>>> 
>>>> I have some updates on these issues and some test results as well.
>>>> 
>>>> We upgraded OpenMPI to the latest version 1.8.2, but when submitting jobs 
>>>> via the SGE orte parallel environment received
>>>> errors whenever more slots are requested than there are actual cores on 
>>>> the first node allocated to the job.
>>> 
>>> Does "-bind-to none" help? The binding is switched on by default in Open 
>>> MPI 1.8 onwards.
>>> 
>>> 
>>>> The btl tcp,self switch passed to mpirun made significant differences in 
>>>> performance as per the below:
>>>> 
>>>> Even with the oversubscribe option, the memory mapping errors still 
>>>> persist. On 32 core nodes and with HPL run compiled for openmpi/1.8.2,  it 
>>>> reliably starts failing at 20 cores allocated. Note that I tested with 
>>>> 'btl tcp,self' defined and it does slow down the solve by 2 on a quick 
>>>> solve. The results on a larger solve would probably be more dramatic:
>>>> - Quick HPL 16 core with SM: ~19GFlops
>>>> - Quick HPL 16 core without SM: ~10GFlops
>>>> 
>>>> Unfortunately, a recompiled HPL did not work, but it did give us more 
>>>> information (error below). Still trying a couple things.
>>>> 
>>>> A request was made to bind to that would result in binding more
>>>> processes than cpus on a resource:
>>>> 
>>>> Bind to:     CORE
>>>> Node:        csclprd3-0-7
>>>> #processes:  2
>>>> #cpus:       1
>>>> 
>>>> You can override this protection by adding the "overload-allowed"
>>>> option to your binding directive.
>>>> 
>>>> When using the SGE make parallel environment to submit jobs everything 
>>>> worked perfectly.
>>>> I noticed when using the make PE, the number of slots allocated from each 
>>>> node to the job
>>>> corresponded to the number of CPU's and disregarded any additional cores 
>>>> within a CPU and
>>>> any hyperthreading cores.
>>> 
>>> For SGE the hyperthreading cores count as normal cores. In principle it's 
>>> possible to have an RQS defined in SGE (`qconf -srqsl`) which will limit 
>>> the number of cores for the "make" PE, or (better) limit it in each 
>>> exechost defintion to the physical installed ones (this is what I set up 
>>> usually - maybe leaving hyperthreading switched on gives some room for the 
>>> kernel processes this way).
>>> 
>>> 
>>>> Here are the definitions of the two parallel environments tested (with 
>>>> orte always failing when
>>>> more slots are requested than there are CPU cores on the first node 
>>>> allocated to the job by
>>>> SGE):
>>>> 
>>>> [root@csclprd3 ~]# qconf -sp orte
>>>> pe_name            orte
>>>> slots              9999
>>>> user_lists         NONE
>>>> xuser_lists        NONE
>>>> start_proc_args    /bin/true
>>>> stop_proc_args     /bin/true
>>>> allocation_rule    $fill_up
>>>> control_slaves     TRUE
>>>> job_is_first_task  FALSE
>>>> urgency_slots      min
>>>> accounting_summary TRUE
>>>> qsort_args         NONE
>>>> 
>>>> [root@csclprd3 ~]# qconf -sp make
>>>> pe_name            make
>>>> slots              999
>>>> user_lists         NONE
>>>> xuser_lists        NONE
>>>> start_proc_args    NONE
>>>> stop_proc_args     NONE
>>>> allocation_rule    $round_robin
>>>> control_slaves     TRUE
>>>> job_is_first_task  FALSE
>>>> urgency_slots      min
>>>> accounting_summary TRUE
>>>> qsort_args         NONE
>>>> 
>>>> Although everything seems to work with the make PE, I'd still like
>>>> to know why? Because on a much older version of openMPI loaded
>>>> on an older version of CentOS, SGE and ROCKS, using all physical
>>>> cores, as well as all hyperthreads was never a problem (even on NUMA
>>>> nodes).
>>>> 
>>>> What is the recommended SGE parallel environment definition for
>>>> OpenMPI 1.8.2?
>>> 
>>> Whether you prefer $fill_up or $round_robin is up to you - do you prefer 
>>> all your processes on the least amount of machines or spread around in the 
>>> cluster. If there is much communication maybe it's better on less machines, 
>>> but if each process has heavy I/O to the local scratch disk spreading it 
>>> around may be the preferred choice. This doesn't make any difference to 
>>> Open MPI, as the generated $PE_HOSTFILE contains just the list of granted 
>>> slots. Doing it in an $fill_up style will of course fill the first node 
>>> including the hyperthreading ones before moving to the next machine (`man 
>>> sge_pe`).
>>> 
>>> -- Reuti
>>> 
>>> 
>>>> I apologize for the length of this, but I thought it best to provide more
>>>> information than less.
>>>> 
>>>> Thank you in advance,
>>>> 
>>>> -Bill Lane
>>>> 
>>>> ________________________________________
>>>> From: users [users-boun...@open-mpi.org] on behalf of Jeff Squyres 
>>>> (jsquyres) [jsquy...@cisco.com]
>>>> Sent: Friday, August 08, 2014 5:25 AM
>>>> To: Open MPI User's List
>>>> Subject: Re: [OMPI users] Mpirun 1.5.4  problems when request > 28 slots
>>>> 
>>>> On Aug 8, 2014, at 1:24 AM, Lane, William <william.l...@cshs.org> wrote:
>>>> 
>>>>> Using the "--mca btl tcp,self" switch to mpirun solved all the issues (in 
>>>>> addition to
>>>>> the requirement to include the --mca btl_tcp_if_include eth0 switch). I 
>>>>> believe
>>>>> the "--mca btl tcp,self" switch limits inter-process communication within 
>>>>> a node to using the TCP
>>>>> loopback rather than shared memory.
>>>> 
>>>> Correct.  You will not be using shared memory for MPI communication at all 
>>>> -- just TCP.
>>>> 
>>>>> I should also point out that all of the nodes
>>>>> on this cluster feature NUMA architecture.
>>>>> 
>>>>> Will using the "--mca btl tcp,self" switch to mpirun result in any 
>>>>> degraded performance
>>>>> issues over using shared memory?
>>>> 
>>>> Generally yes, but it depends on your application.  If your application 
>>>> does very little MPI communication, then the difference between shared 
>>>> memory and TCP is likely negligible.
>>>> 
>>>> I'd strongly suggest two things:
>>>> 
>>>> - Upgrade to at least Open MPI 1.6.5 (1.8.x would be better, if possible)
>>>> - Run your program through a memory-checking debugger such as Valgrind
>>>> 
>>>> Seg faults like you initially described can be caused by errors in your 
>>>> MPI application itself -- the fact that using TCP only (and not shared 
>>>> memory) avoids the segvs does not mean that the issue is actually fixed; 
>>>> it may well mean that the error is still there, but is happening in a case 
>>>> that doesn't seem to cause enough damage to cause a segv.
>>>> 
>>>> --
>>>> Jeff Squyres
>>>> jsquy...@cisco.com
>>>> For corporate legal information go to: 
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2014/08/24951.php
>>>> IMPORTANT WARNING: This message is intended for the use of the person or 
>>>> entity to which it is addressed and may contain information that is 
>>>> privileged and confidential, the disclosure of which is governed by 
>>>> applicable law. If the reader of this message is not the intended 
>>>> recipient, or the employee or agent responsible for delivering it to the 
>>>> intended recipient, you are hereby notified that any dissemination, 
>>>> distribution or copying of this information is STRICTLY PROHIBITED. If you 
>>>> have received this message in error, please notify us immediately by 
>>>> calling (310) 423-6428 and destroy the related message. Thank You for your 
>>>> cooperation.
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2014/08/25176.php
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/08/25179.php
>>> IMPORTANT WARNING: This message is intended for the use of the person or 
>>> entity to which it is addressed and may contain information that is 
>>> privileged and confidential, the disclosure of which is governed by 
>>> applicable law. If the reader of this message is not the intended 
>>> recipient, or the employee or agent responsible for delivering it to the 
>>> intended recipient, you are hereby notified that any dissemination, 
>>> distribution or copying of this information is STRICTLY PROHIBITED. If you 
>>> have received this message in error, please notify us immediately by 
>>> calling (310) 423-6428 and destroy the related message. Thank You for your 
>>> cooperation.
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/08/25202.php
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/08/25203.php
>> IMPORTANT WARNING: This message is intended for the use of the person or 
>> entity to which it is addressed and may contain information that is 
>> privileged and confidential, the disclosure of which is governed by 
>> applicable law. If the reader of this message is not the intended recipient, 
>> or the employee or agent responsible for delivering it to the intended 
>> recipient, you are hereby notified that any dissemination, distribution or 
>> copying of this information is STRICTLY PROHIBITED. If you have received 
>> this message in error, please notify us immediately by calling (310) 
>> 423-6428 and destroy the related message. Thank You for your cooperation.
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/09/25224.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25226.php
> IMPORTANT WARNING: This message is intended for the use of the person or 
> entity to which it is addressed and may contain information that is 
> privileged and confidential, the disclosure of which is governed by 
> applicable law. If the reader of this message is not the intended recipient, 
> or the employee or agent responsible for delivering it to the intended 
> recipient, you are hereby notified that any dissemination, distribution or 
> copying of this information is STRICTLY PROHIBITED. If you have received this 
> message in error, please notify us immediately by calling (310) 423-6428 and 
> destroy the related message. Thank You for your cooperation.
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25229.php

Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots (updated findings)

Reply via email to