Lane -- Can you confirm that adding numactl-devel and using --hetero-nodes fixed your problem?
On Sep 2, 2014, at 5:08 PM, Ralph Castain <r...@open-mpi.org> wrote: > Argh - yeah, I got confused as things context switched a few too many times. > The 1.8.2 release should certainly understand that arrangement, and > --hetero-nodes. The only way it wouldn't see the latter is if you configure > it --without-hwloc, or hwloc refused to build. > > Since there was a question about the numactl-devel requirement, I suspect > that is the root cause of all evil in this case and the lack of > --hetero-nodes would confirm that diagnosis :-) > > > > On Sep 2, 2014, at 1:49 PM, Lane, William <william.l...@cshs.org> wrote: > >> Ralph, >> >> These latest issues (since 8/28/14) all occurred after we upgraded our >> cluster >> to OpenMPI 1.8.2 on . Maybe I should've created a new thread rather >> than tacking on these issues to my existing thread. >> >> -Bill Lane >> >> ________________________________________ >> From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain >> [r...@open-mpi.org] >> Sent: Tuesday, September 02, 2014 11:03 AM >> To: Open MPI Users >> Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 >> slots (updated findings) >> >> On Sep 2, 2014, at 10:48 AM, Lane, William <william.l...@cshs.org> wrote: >> >>> Ralph, >>> >>> There are at least three different permutations of CPU configurations in >>> the cluster >>> involved. Some are blades that have two sockets with two cores per Intel >>> CPU (and not all >>> sockets are filled). Some are IBM x3550 systems having two sockets with >>> three cores >>> per Intel CPU (and not all sockets are populated). All nodes have >>> hyperthreading turned >>> on as well. >>> >>> I will look into getting the numactl-devel package installed. >>> >>> I will try the --bind-to none switch again. For some reason the >>> --hetero-nodes switch wasn't >>> recognized by mpirun. Is the --hetero-nodes swtich an MCA parameter? >> >> My bad - I forgot that you are using a very old OMPI version. I think you'll >> need to upgrade, though, as I don't believe something that old will know how >> to handle such a hybrid system. I suspect this may be at the bottom of the >> problem you are seeing. >> >> You'll really need to get up to the 1.8 series, I'm afraid - I'm not sure >> even 1.6 can handle this setup. >> >>> >>> Thanks for your help. >>> >>> -Bill Lane >>> ________________________________________ >>> From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain >>> [r...@open-mpi.org] >>> Sent: Saturday, August 30, 2014 7:15 AM >>> To: Open MPI Users >>> Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 >>> slots (updated findings) >>> >>> hwloc requires the numactl-devel package in addition to the numactl one >>> >>> If I understand the email thread correctly, it sounds like you have at >>> least some nodes in your system that have fewer cores than others - is that >>> correct? >>> >>>>> Here are the definitions of the two parallel environments tested (with >>>>> orte always failing when >>>>> more slots are requested than there are CPU cores on the first node >>>>> allocated to the job by >>>>> SGE): >>> >>> If that is the situation, then you need to add --hetero-nodes to your cmd >>> line so we look at the actual topology of every node. Otherwise, for >>> scalability reasons, we only look at the first node in the allocation and >>> assume all nodes are the same. >>> >>> If that isn't the case, then it sounds like we are seeing fewer cores than >>> exist on the system for some reason. You could try installing hwloc >>> independently, and then running "lstopo" to find out what it detects. >>> Another thing you could do is add "-mca plm_base_verbose 100" to your cmd >>> line (I suggest doing that with just a couple of nodes in your allocation) >>> and that will dump the detected topology to stderr. >>> >>> I'm surprised the bind-to none option didn't remove the error - it >>> definitely should as we won't be binding when that is given. However, I >>> note that you misspelled it in your reply, so maybe you just didn't type it >>> correctly? It is "--bind-to none" - note the space between the "to" and the >>> "none". You'll take a performance hit, but it should at least run. >>> >>> >>> >>> On Aug 29, 2014, at 11:29 PM, Lane, William <william.l...@cshs.org> wrote: >>> >>>> The --bind-to-none switch didn't help, I'm still getting the same errors. >>>> >>>> The only NUMA package installed on the nodes in this CentOS 6.2 cluster is >>>> the >>>> following: >>>> >>>> numactl-2.0.7-3.el6.x86_64 >>>> this package is described as: numactl.x86_64 : Library for tuning for Non >>>> Uniform Memory Access machines >>>> >>>> Since many of these systems are NUMA systems (with separate memory address >>>> spaces >>>> for the sockets) could it be that the correct NUMA libraries aren't >>>> installed? >>>> >>>> Here are some of the other NUMA packages available for CentOS 6.x: >>>> >>>> yum search numa | less >>>> >>>> Loaded plugins: fastestmirror >>>> Loading mirror speeds from cached hostfile >>>> ============================== N/S Matched: numa >>>> =============================== >>>> numactl-devel.i686 : Development package for building >>>> Applications that use numa >>>> numactl-devel.x86_64 : Development package for building >>>> Applications that use >>>> : numa >>>> numad.x86_64 : NUMA user daemon >>>> numactl.i686 : Library for tuning for Non Uniform Memory >>>> Access machines >>>> numactl.x86_64 : Library for tuning for Non Uniform Memory >>>> Access machines >>>> >>>> -Bill Lane >>>> ________________________________________ >>>> From: users [users-boun...@open-mpi.org] on behalf of Reuti >>>> [re...@staff.uni-marburg.de] >>>> Sent: Thursday, August 28, 2014 3:27 AM >>>> To: Open MPI Users >>>> Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots >>>> (updated findings) >>>> >>>> Am 28.08.2014 um 10:09 schrieb Lane, William: >>>> >>>>> I have some updates on these issues and some test results as well. >>>>> >>>>> We upgraded OpenMPI to the latest version 1.8.2, but when submitting jobs >>>>> via the SGE orte parallel environment received >>>>> errors whenever more slots are requested than there are actual cores on >>>>> the first node allocated to the job. >>>> >>>> Does "-bind-to none" help? The binding is switched on by default in Open >>>> MPI 1.8 onwards. >>>> >>>> >>>>> The btl tcp,self switch passed to mpirun made significant differences in >>>>> performance as per the below: >>>>> >>>>> Even with the oversubscribe option, the memory mapping errors still >>>>> persist. On 32 core nodes and with HPL run compiled for openmpi/1.8.2, >>>>> it reliably starts failing at 20 cores allocated. Note that I tested with >>>>> 'btl tcp,self' defined and it does slow down the solve by 2 on a quick >>>>> solve. The results on a larger solve would probably be more dramatic: >>>>> - Quick HPL 16 core with SM: ~19GFlops >>>>> - Quick HPL 16 core without SM: ~10GFlops >>>>> >>>>> Unfortunately, a recompiled HPL did not work, but it did give us more >>>>> information (error below). Still trying a couple things. >>>>> >>>>> A request was made to bind to that would result in binding more >>>>> processes than cpus on a resource: >>>>> >>>>> Bind to: CORE >>>>> Node: csclprd3-0-7 >>>>> #processes: 2 >>>>> #cpus: 1 >>>>> >>>>> You can override this protection by adding the "overload-allowed" >>>>> option to your binding directive. >>>>> >>>>> When using the SGE make parallel environment to submit jobs everything >>>>> worked perfectly. >>>>> I noticed when using the make PE, the number of slots allocated from each >>>>> node to the job >>>>> corresponded to the number of CPU's and disregarded any additional cores >>>>> within a CPU and >>>>> any hyperthreading cores. >>>> >>>> For SGE the hyperthreading cores count as normal cores. In principle it's >>>> possible to have an RQS defined in SGE (`qconf -srqsl`) which will limit >>>> the number of cores for the "make" PE, or (better) limit it in each >>>> exechost defintion to the physical installed ones (this is what I set up >>>> usually - maybe leaving hyperthreading switched on gives some room for the >>>> kernel processes this way). >>>> >>>> >>>>> Here are the definitions of the two parallel environments tested (with >>>>> orte always failing when >>>>> more slots are requested than there are CPU cores on the first node >>>>> allocated to the job by >>>>> SGE): >>>>> >>>>> [root@csclprd3 ~]# qconf -sp orte >>>>> pe_name orte >>>>> slots 9999 >>>>> user_lists NONE >>>>> xuser_lists NONE >>>>> start_proc_args /bin/true >>>>> stop_proc_args /bin/true >>>>> allocation_rule $fill_up >>>>> control_slaves TRUE >>>>> job_is_first_task FALSE >>>>> urgency_slots min >>>>> accounting_summary TRUE >>>>> qsort_args NONE >>>>> >>>>> [root@csclprd3 ~]# qconf -sp make >>>>> pe_name make >>>>> slots 999 >>>>> user_lists NONE >>>>> xuser_lists NONE >>>>> start_proc_args NONE >>>>> stop_proc_args NONE >>>>> allocation_rule $round_robin >>>>> control_slaves TRUE >>>>> job_is_first_task FALSE >>>>> urgency_slots min >>>>> accounting_summary TRUE >>>>> qsort_args NONE >>>>> >>>>> Although everything seems to work with the make PE, I'd still like >>>>> to know why? Because on a much older version of openMPI loaded >>>>> on an older version of CentOS, SGE and ROCKS, using all physical >>>>> cores, as well as all hyperthreads was never a problem (even on NUMA >>>>> nodes). >>>>> >>>>> What is the recommended SGE parallel environment definition for >>>>> OpenMPI 1.8.2? >>>> >>>> Whether you prefer $fill_up or $round_robin is up to you - do you prefer >>>> all your processes on the least amount of machines or spread around in the >>>> cluster. If there is much communication maybe it's better on less >>>> machines, but if each process has heavy I/O to the local scratch disk >>>> spreading it around may be the preferred choice. This doesn't make any >>>> difference to Open MPI, as the generated $PE_HOSTFILE contains just the >>>> list of granted slots. Doing it in an $fill_up style will of course fill >>>> the first node including the hyperthreading ones before moving to the next >>>> machine (`man sge_pe`). >>>> >>>> -- Reuti >>>> >>>> >>>>> I apologize for the length of this, but I thought it best to provide more >>>>> information than less. >>>>> >>>>> Thank you in advance, >>>>> >>>>> -Bill Lane >>>>> >>>>> ________________________________________ >>>>> From: users [users-boun...@open-mpi.org] on behalf of Jeff Squyres >>>>> (jsquyres) [jsquy...@cisco.com] >>>>> Sent: Friday, August 08, 2014 5:25 AM >>>>> To: Open MPI User's List >>>>> Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots >>>>> >>>>> On Aug 8, 2014, at 1:24 AM, Lane, William <william.l...@cshs.org> wrote: >>>>> >>>>>> Using the "--mca btl tcp,self" switch to mpirun solved all the issues >>>>>> (in addition to >>>>>> the requirement to include the --mca btl_tcp_if_include eth0 switch). I >>>>>> believe >>>>>> the "--mca btl tcp,self" switch limits inter-process communication >>>>>> within a node to using the TCP >>>>>> loopback rather than shared memory. >>>>> >>>>> Correct. You will not be using shared memory for MPI communication at >>>>> all -- just TCP. >>>>> >>>>>> I should also point out that all of the nodes >>>>>> on this cluster feature NUMA architecture. >>>>>> >>>>>> Will using the "--mca btl tcp,self" switch to mpirun result in any >>>>>> degraded performance >>>>>> issues over using shared memory? >>>>> >>>>> Generally yes, but it depends on your application. If your application >>>>> does very little MPI communication, then the difference between shared >>>>> memory and TCP is likely negligible. >>>>> >>>>> I'd strongly suggest two things: >>>>> >>>>> - Upgrade to at least Open MPI 1.6.5 (1.8.x would be better, if possible) >>>>> - Run your program through a memory-checking debugger such as Valgrind >>>>> >>>>> Seg faults like you initially described can be caused by errors in your >>>>> MPI application itself -- the fact that using TCP only (and not shared >>>>> memory) avoids the segvs does not mean that the issue is actually fixed; >>>>> it may well mean that the error is still there, but is happening in a >>>>> case that doesn't seem to cause enough damage to cause a segv. >>>>> >>>>> -- >>>>> Jeff Squyres >>>>> jsquy...@cisco.com >>>>> For corporate legal information go to: >>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2014/08/24951.php >>>>> IMPORTANT WARNING: This message is intended for the use of the person or >>>>> entity to which it is addressed and may contain information that is >>>>> privileged and confidential, the disclosure of which is governed by >>>>> applicable law. If the reader of this message is not the intended >>>>> recipient, or the employee or agent responsible for delivering it to the >>>>> intended recipient, you are hereby notified that any dissemination, >>>>> distribution or copying of this information is STRICTLY PROHIBITED. If >>>>> you have received this message in error, please notify us immediately by >>>>> calling (310) 423-6428 and destroy the related message. Thank You for >>>>> your cooperation. >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2014/08/25176.php >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/08/25179.php >>>> IMPORTANT WARNING: This message is intended for the use of the person or >>>> entity to which it is addressed and may contain information that is >>>> privileged and confidential, the disclosure of which is governed by >>>> applicable law. If the reader of this message is not the intended >>>> recipient, or the employee or agent responsible for delivering it to the >>>> intended recipient, you are hereby notified that any dissemination, >>>> distribution or copying of this information is STRICTLY PROHIBITED. If you >>>> have received this message in error, please notify us immediately by >>>> calling (310) 423-6428 and destroy the related message. Thank You for your >>>> cooperation. >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/08/25202.php >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/08/25203.php >>> IMPORTANT WARNING: This message is intended for the use of the person or >>> entity to which it is addressed and may contain information that is >>> privileged and confidential, the disclosure of which is governed by >>> applicable law. If the reader of this message is not the intended >>> recipient, or the employee or agent responsible for delivering it to the >>> intended recipient, you are hereby notified that any dissemination, >>> distribution or copying of this information is STRICTLY PROHIBITED. If you >>> have received this message in error, please notify us immediately by >>> calling (310) 423-6428 and destroy the related message. Thank You for your >>> cooperation. >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/09/25224.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/09/25226.php >> IMPORTANT WARNING: This message is intended for the use of the person or >> entity to which it is addressed and may contain information that is >> privileged and confidential, the disclosure of which is governed by >> applicable law. If the reader of this message is not the intended recipient, >> or the employee or agent responsible for delivering it to the intended >> recipient, you are hereby notified that any dissemination, distribution or >> copying of this information is STRICTLY PROHIBITED. If you have received >> this message in error, please notify us immediately by calling (310) >> 423-6428 and destroy the related message. Thank You for your cooperation. >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/09/25229.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/09/25231.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/