The --bind-to-none switch didn't help, I'm still getting the same errors. The only NUMA package installed on the nodes in this CentOS 6.2 cluster is the following:
numactl-2.0.7-3.el6.x86_64 this package is described as: numactl.x86_64 : Library for tuning for Non Uniform Memory Access machines Since many of these systems are NUMA systems (with separate memory address spaces for the sockets) could it be that the correct NUMA libraries aren't installed? Here are some of the other NUMA packages available for CentOS 6.x: yum search numa | less Loaded plugins: fastestmirror Loading mirror speeds from cached hostfile ============================== N/S Matched: numa =============================== numactl-devel.i686 : Development package for building Applications that use numa numactl-devel.x86_64 : Development package for building Applications that use : numa numad.x86_64 : NUMA user daemon numactl.i686 : Library for tuning for Non Uniform Memory Access machines numactl.x86_64 : Library for tuning for Non Uniform Memory Access machines -Bill Lane ________________________________________ From: users [users-boun...@open-mpi.org] on behalf of Reuti [re...@staff.uni-marburg.de] Sent: Thursday, August 28, 2014 3:27 AM To: Open MPI Users Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots (updated findings) Am 28.08.2014 um 10:09 schrieb Lane, William: > I have some updates on these issues and some test results as well. > > We upgraded OpenMPI to the latest version 1.8.2, but when submitting jobs via > the SGE orte parallel environment received > errors whenever more slots are requested than there are actual cores on the > first node allocated to the job. Does "-bind-to none" help? The binding is switched on by default in Open MPI 1.8 onwards. > The btl tcp,self switch passed to mpirun made significant differences in > performance as per the below: > > Even with the oversubscribe option, the memory mapping errors still persist. > On 32 core nodes and with HPL run compiled for openmpi/1.8.2, it reliably > starts failing at 20 cores allocated. Note that I tested with 'btl tcp,self' > defined and it does slow down the solve by 2 on a quick solve. The results on > a larger solve would probably be more dramatic: > - Quick HPL 16 core with SM: ~19GFlops > - Quick HPL 16 core without SM: ~10GFlops > > Unfortunately, a recompiled HPL did not work, but it did give us more > information (error below). Still trying a couple things. > > A request was made to bind to that would result in binding more > processes than cpus on a resource: > > Bind to: CORE > Node: csclprd3-0-7 > #processes: 2 > #cpus: 1 > > You can override this protection by adding the "overload-allowed" > option to your binding directive. > > When using the SGE make parallel environment to submit jobs everything worked > perfectly. > I noticed when using the make PE, the number of slots allocated from each > node to the job > corresponded to the number of CPU's and disregarded any additional cores > within a CPU and > any hyperthreading cores. For SGE the hyperthreading cores count as normal cores. In principle it's possible to have an RQS defined in SGE (`qconf -srqsl`) which will limit the number of cores for the "make" PE, or (better) limit it in each exechost defintion to the physical installed ones (this is what I set up usually - maybe leaving hyperthreading switched on gives some room for the kernel processes this way). > Here are the definitions of the two parallel environments tested (with orte > always failing when > more slots are requested than there are CPU cores on the first node allocated > to the job by > SGE): > > [root@csclprd3 ~]# qconf -sp orte > pe_name orte > slots 9999 > user_lists NONE > xuser_lists NONE > start_proc_args /bin/true > stop_proc_args /bin/true > allocation_rule $fill_up > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > accounting_summary TRUE > qsort_args NONE > > [root@csclprd3 ~]# qconf -sp make > pe_name make > slots 999 > user_lists NONE > xuser_lists NONE > start_proc_args NONE > stop_proc_args NONE > allocation_rule $round_robin > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > accounting_summary TRUE > qsort_args NONE > > Although everything seems to work with the make PE, I'd still like > to know why? Because on a much older version of openMPI loaded > on an older version of CentOS, SGE and ROCKS, using all physical > cores, as well as all hyperthreads was never a problem (even on NUMA > nodes). > > What is the recommended SGE parallel environment definition for > OpenMPI 1.8.2? Whether you prefer $fill_up or $round_robin is up to you - do you prefer all your processes on the least amount of machines or spread around in the cluster. If there is much communication maybe it's better on less machines, but if each process has heavy I/O to the local scratch disk spreading it around may be the preferred choice. This doesn't make any difference to Open MPI, as the generated $PE_HOSTFILE contains just the list of granted slots. Doing it in an $fill_up style will of course fill the first node including the hyperthreading ones before moving to the next machine (`man sge_pe`). -- Reuti > I apologize for the length of this, but I thought it best to provide more > information than less. > > Thank you in advance, > > -Bill Lane > > ________________________________________ > From: users [users-boun...@open-mpi.org] on behalf of Jeff Squyres (jsquyres) > [jsquy...@cisco.com] > Sent: Friday, August 08, 2014 5:25 AM > To: Open MPI User's List > Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots > > On Aug 8, 2014, at 1:24 AM, Lane, William <william.l...@cshs.org> wrote: > >> Using the "--mca btl tcp,self" switch to mpirun solved all the issues (in >> addition to >> the requirement to include the --mca btl_tcp_if_include eth0 switch). I >> believe >> the "--mca btl tcp,self" switch limits inter-process communication within a >> node to using the TCP >> loopback rather than shared memory. > > Correct. You will not be using shared memory for MPI communication at all -- > just TCP. > >> I should also point out that all of the nodes >> on this cluster feature NUMA architecture. >> >> Will using the "--mca btl tcp,self" switch to mpirun result in any degraded >> performance >> issues over using shared memory? > > Generally yes, but it depends on your application. If your application does > very little MPI communication, then the difference between shared memory and > TCP is likely negligible. > > I'd strongly suggest two things: > > - Upgrade to at least Open MPI 1.6.5 (1.8.x would be better, if possible) > - Run your program through a memory-checking debugger such as Valgrind > > Seg faults like you initially described can be caused by errors in your MPI > application itself -- the fact that using TCP only (and not shared memory) > avoids the segvs does not mean that the issue is actually fixed; it may well > mean that the error is still there, but is happening in a case that doesn't > seem to cause enough damage to cause a segv. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/24951.php > IMPORTANT WARNING: This message is intended for the use of the person or > entity to which it is addressed and may contain information that is > privileged and confidential, the disclosure of which is governed by > applicable law. If the reader of this message is not the intended recipient, > or the employee or agent responsible for delivering it to the intended > recipient, you are hereby notified that any dissemination, distribution or > copying of this information is STRICTLY PROHIBITED. If you have received this > message in error, please notify us immediately by calling (310) 423-6428 and > destroy the related message. Thank You for your cooperation. > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25176.php _______________________________________________ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25179.php IMPORTANT WARNING: This message is intended for the use of the person or entity to which it is addressed and may contain information that is privileged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this information is STRICTLY PROHIBITED. If you have received this message in error, please notify us immediately by calling (310) 423-6428 and destroy the related message. Thank You for your cooperation.