Argh - yeah, I got confused as things context switched a few too many times. The 1.8.2 release should certainly understand that arrangement, and --hetero-nodes. The only way it wouldn't see the latter is if you configure it --without-hwloc, or hwloc refused to build.
Since there was a question about the numactl-devel requirement, I suspect that is the root cause of all evil in this case and the lack of --hetero-nodes would confirm that diagnosis :-) On Sep 2, 2014, at 1:49 PM, Lane, William <william.l...@cshs.org> wrote: > Ralph, > > These latest issues (since 8/28/14) all occurred after we upgraded our cluster > to OpenMPI 1.8.2 on . Maybe I should've created a new thread rather > than tacking on these issues to my existing thread. > > -Bill Lane > > ________________________________________ > From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain > [r...@open-mpi.org] > Sent: Tuesday, September 02, 2014 11:03 AM > To: Open MPI Users > Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots > (updated findings) > > On Sep 2, 2014, at 10:48 AM, Lane, William <william.l...@cshs.org> wrote: > >> Ralph, >> >> There are at least three different permutations of CPU configurations in the >> cluster >> involved. Some are blades that have two sockets with two cores per Intel CPU >> (and not all >> sockets are filled). Some are IBM x3550 systems having two sockets with >> three cores >> per Intel CPU (and not all sockets are populated). All nodes have >> hyperthreading turned >> on as well. >> >> I will look into getting the numactl-devel package installed. >> >> I will try the --bind-to none switch again. For some reason the >> --hetero-nodes switch wasn't >> recognized by mpirun. Is the --hetero-nodes swtich an MCA parameter? > > My bad - I forgot that you are using a very old OMPI version. I think you'll > need to upgrade, though, as I don't believe something that old will know how > to handle such a hybrid system. I suspect this may be at the bottom of the > problem you are seeing. > > You'll really need to get up to the 1.8 series, I'm afraid - I'm not sure > even 1.6 can handle this setup. > >> >> Thanks for your help. >> >> -Bill Lane >> ________________________________________ >> From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain >> [r...@open-mpi.org] >> Sent: Saturday, August 30, 2014 7:15 AM >> To: Open MPI Users >> Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 >> slots (updated findings) >> >> hwloc requires the numactl-devel package in addition to the numactl one >> >> If I understand the email thread correctly, it sounds like you have at least >> some nodes in your system that have fewer cores than others - is that >> correct? >> >>>> Here are the definitions of the two parallel environments tested (with >>>> orte always failing when >>>> more slots are requested than there are CPU cores on the first node >>>> allocated to the job by >>>> SGE): >> >> If that is the situation, then you need to add --hetero-nodes to your cmd >> line so we look at the actual topology of every node. Otherwise, for >> scalability reasons, we only look at the first node in the allocation and >> assume all nodes are the same. >> >> If that isn't the case, then it sounds like we are seeing fewer cores than >> exist on the system for some reason. You could try installing hwloc >> independently, and then running "lstopo" to find out what it detects. >> Another thing you could do is add "-mca plm_base_verbose 100" to your cmd >> line (I suggest doing that with just a couple of nodes in your allocation) >> and that will dump the detected topology to stderr. >> >> I'm surprised the bind-to none option didn't remove the error - it >> definitely should as we won't be binding when that is given. However, I note >> that you misspelled it in your reply, so maybe you just didn't type it >> correctly? It is "--bind-to none" - note the space between the "to" and the >> "none". You'll take a performance hit, but it should at least run. >> >> >> >> On Aug 29, 2014, at 11:29 PM, Lane, William <william.l...@cshs.org> wrote: >> >>> The --bind-to-none switch didn't help, I'm still getting the same errors. >>> >>> The only NUMA package installed on the nodes in this CentOS 6.2 cluster is >>> the >>> following: >>> >>> numactl-2.0.7-3.el6.x86_64 >>> this package is described as: numactl.x86_64 : Library for tuning for Non >>> Uniform Memory Access machines >>> >>> Since many of these systems are NUMA systems (with separate memory address >>> spaces >>> for the sockets) could it be that the correct NUMA libraries aren't >>> installed? >>> >>> Here are some of the other NUMA packages available for CentOS 6.x: >>> >>> yum search numa | less >>> >>> Loaded plugins: fastestmirror >>> Loading mirror speeds from cached hostfile >>> ============================== N/S Matched: numa >>> =============================== >>> numactl-devel.i686 : Development package for building >>> Applications that use numa >>> numactl-devel.x86_64 : Development package for building >>> Applications that use >>> : numa >>> numad.x86_64 : NUMA user daemon >>> numactl.i686 : Library for tuning for Non Uniform Memory >>> Access machines >>> numactl.x86_64 : Library for tuning for Non Uniform Memory >>> Access machines >>> >>> -Bill Lane >>> ________________________________________ >>> From: users [users-boun...@open-mpi.org] on behalf of Reuti >>> [re...@staff.uni-marburg.de] >>> Sent: Thursday, August 28, 2014 3:27 AM >>> To: Open MPI Users >>> Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots >>> (updated findings) >>> >>> Am 28.08.2014 um 10:09 schrieb Lane, William: >>> >>>> I have some updates on these issues and some test results as well. >>>> >>>> We upgraded OpenMPI to the latest version 1.8.2, but when submitting jobs >>>> via the SGE orte parallel environment received >>>> errors whenever more slots are requested than there are actual cores on >>>> the first node allocated to the job. >>> >>> Does "-bind-to none" help? The binding is switched on by default in Open >>> MPI 1.8 onwards. >>> >>> >>>> The btl tcp,self switch passed to mpirun made significant differences in >>>> performance as per the below: >>>> >>>> Even with the oversubscribe option, the memory mapping errors still >>>> persist. On 32 core nodes and with HPL run compiled for openmpi/1.8.2, it >>>> reliably starts failing at 20 cores allocated. Note that I tested with >>>> 'btl tcp,self' defined and it does slow down the solve by 2 on a quick >>>> solve. The results on a larger solve would probably be more dramatic: >>>> - Quick HPL 16 core with SM: ~19GFlops >>>> - Quick HPL 16 core without SM: ~10GFlops >>>> >>>> Unfortunately, a recompiled HPL did not work, but it did give us more >>>> information (error below). Still trying a couple things. >>>> >>>> A request was made to bind to that would result in binding more >>>> processes than cpus on a resource: >>>> >>>> Bind to: CORE >>>> Node: csclprd3-0-7 >>>> #processes: 2 >>>> #cpus: 1 >>>> >>>> You can override this protection by adding the "overload-allowed" >>>> option to your binding directive. >>>> >>>> When using the SGE make parallel environment to submit jobs everything >>>> worked perfectly. >>>> I noticed when using the make PE, the number of slots allocated from each >>>> node to the job >>>> corresponded to the number of CPU's and disregarded any additional cores >>>> within a CPU and >>>> any hyperthreading cores. >>> >>> For SGE the hyperthreading cores count as normal cores. In principle it's >>> possible to have an RQS defined in SGE (`qconf -srqsl`) which will limit >>> the number of cores for the "make" PE, or (better) limit it in each >>> exechost defintion to the physical installed ones (this is what I set up >>> usually - maybe leaving hyperthreading switched on gives some room for the >>> kernel processes this way). >>> >>> >>>> Here are the definitions of the two parallel environments tested (with >>>> orte always failing when >>>> more slots are requested than there are CPU cores on the first node >>>> allocated to the job by >>>> SGE): >>>> >>>> [root@csclprd3 ~]# qconf -sp orte >>>> pe_name orte >>>> slots 9999 >>>> user_lists NONE >>>> xuser_lists NONE >>>> start_proc_args /bin/true >>>> stop_proc_args /bin/true >>>> allocation_rule $fill_up >>>> control_slaves TRUE >>>> job_is_first_task FALSE >>>> urgency_slots min >>>> accounting_summary TRUE >>>> qsort_args NONE >>>> >>>> [root@csclprd3 ~]# qconf -sp make >>>> pe_name make >>>> slots 999 >>>> user_lists NONE >>>> xuser_lists NONE >>>> start_proc_args NONE >>>> stop_proc_args NONE >>>> allocation_rule $round_robin >>>> control_slaves TRUE >>>> job_is_first_task FALSE >>>> urgency_slots min >>>> accounting_summary TRUE >>>> qsort_args NONE >>>> >>>> Although everything seems to work with the make PE, I'd still like >>>> to know why? Because on a much older version of openMPI loaded >>>> on an older version of CentOS, SGE and ROCKS, using all physical >>>> cores, as well as all hyperthreads was never a problem (even on NUMA >>>> nodes). >>>> >>>> What is the recommended SGE parallel environment definition for >>>> OpenMPI 1.8.2? >>> >>> Whether you prefer $fill_up or $round_robin is up to you - do you prefer >>> all your processes on the least amount of machines or spread around in the >>> cluster. If there is much communication maybe it's better on less machines, >>> but if each process has heavy I/O to the local scratch disk spreading it >>> around may be the preferred choice. This doesn't make any difference to >>> Open MPI, as the generated $PE_HOSTFILE contains just the list of granted >>> slots. Doing it in an $fill_up style will of course fill the first node >>> including the hyperthreading ones before moving to the next machine (`man >>> sge_pe`). >>> >>> -- Reuti >>> >>> >>>> I apologize for the length of this, but I thought it best to provide more >>>> information than less. >>>> >>>> Thank you in advance, >>>> >>>> -Bill Lane >>>> >>>> ________________________________________ >>>> From: users [users-boun...@open-mpi.org] on behalf of Jeff Squyres >>>> (jsquyres) [jsquy...@cisco.com] >>>> Sent: Friday, August 08, 2014 5:25 AM >>>> To: Open MPI User's List >>>> Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots >>>> >>>> On Aug 8, 2014, at 1:24 AM, Lane, William <william.l...@cshs.org> wrote: >>>> >>>>> Using the "--mca btl tcp,self" switch to mpirun solved all the issues (in >>>>> addition to >>>>> the requirement to include the --mca btl_tcp_if_include eth0 switch). I >>>>> believe >>>>> the "--mca btl tcp,self" switch limits inter-process communication within >>>>> a node to using the TCP >>>>> loopback rather than shared memory. >>>> >>>> Correct. You will not be using shared memory for MPI communication at all >>>> -- just TCP. >>>> >>>>> I should also point out that all of the nodes >>>>> on this cluster feature NUMA architecture. >>>>> >>>>> Will using the "--mca btl tcp,self" switch to mpirun result in any >>>>> degraded performance >>>>> issues over using shared memory? >>>> >>>> Generally yes, but it depends on your application. If your application >>>> does very little MPI communication, then the difference between shared >>>> memory and TCP is likely negligible. >>>> >>>> I'd strongly suggest two things: >>>> >>>> - Upgrade to at least Open MPI 1.6.5 (1.8.x would be better, if possible) >>>> - Run your program through a memory-checking debugger such as Valgrind >>>> >>>> Seg faults like you initially described can be caused by errors in your >>>> MPI application itself -- the fact that using TCP only (and not shared >>>> memory) avoids the segvs does not mean that the issue is actually fixed; >>>> it may well mean that the error is still there, but is happening in a case >>>> that doesn't seem to cause enough damage to cause a segv. >>>> >>>> -- >>>> Jeff Squyres >>>> jsquy...@cisco.com >>>> For corporate legal information go to: >>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/08/24951.php >>>> IMPORTANT WARNING: This message is intended for the use of the person or >>>> entity to which it is addressed and may contain information that is >>>> privileged and confidential, the disclosure of which is governed by >>>> applicable law. If the reader of this message is not the intended >>>> recipient, or the employee or agent responsible for delivering it to the >>>> intended recipient, you are hereby notified that any dissemination, >>>> distribution or copying of this information is STRICTLY PROHIBITED. If you >>>> have received this message in error, please notify us immediately by >>>> calling (310) 423-6428 and destroy the related message. Thank You for your >>>> cooperation. >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/08/25176.php >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/08/25179.php >>> IMPORTANT WARNING: This message is intended for the use of the person or >>> entity to which it is addressed and may contain information that is >>> privileged and confidential, the disclosure of which is governed by >>> applicable law. If the reader of this message is not the intended >>> recipient, or the employee or agent responsible for delivering it to the >>> intended recipient, you are hereby notified that any dissemination, >>> distribution or copying of this information is STRICTLY PROHIBITED. If you >>> have received this message in error, please notify us immediately by >>> calling (310) 423-6428 and destroy the related message. Thank You for your >>> cooperation. >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/08/25202.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/08/25203.php >> IMPORTANT WARNING: This message is intended for the use of the person or >> entity to which it is addressed and may contain information that is >> privileged and confidential, the disclosure of which is governed by >> applicable law. If the reader of this message is not the intended recipient, >> or the employee or agent responsible for delivering it to the intended >> recipient, you are hereby notified that any dissemination, distribution or >> copying of this information is STRICTLY PROHIBITED. If you have received >> this message in error, please notify us immediately by calling (310) >> 423-6428 and destroy the related message. Thank You for your cooperation. >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/09/25224.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/09/25226.php > IMPORTANT WARNING: This message is intended for the use of the person or > entity to which it is addressed and may contain information that is > privileged and confidential, the disclosure of which is governed by > applicable law. If the reader of this message is not the intended recipient, > or the employee or agent responsible for delivering it to the intended > recipient, you are hereby notified that any dissemination, distribution or > copying of this information is STRICTLY PROHIBITED. If you have received this > message in error, please notify us immediately by calling (310) 423-6428 and > destroy the related message. Thank You for your cooperation. > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/09/25229.php