I have some updates on these issues and some test results as well. We upgraded OpenMPI to the latest version 1.8.2, but when submitting jobs via the SGE orte parallel environment received errors whenever more slots are requested than there are actual cores on the first node allocated to the job.
The btl tcp,self switch passed to mpirun made significant differences in performance as per the below: Even with the oversubscribe option, the memory mapping errors still persist. On 32 core nodes and with HPL run compiled for openmpi/1.8.2, it reliably starts failing at 20 cores allocated. Note that I tested with 'btl tcp,self' defined and it does slow down the solve by 2 on a quick solve. The results on a larger solve would probably be more dramatic: - Quick HPL 16 core with SM: ~19GFlops - Quick HPL 16 core without SM: ~10GFlops Unfortunately, a recompiled HPL did not work, but it did give us more information (error below). Still trying a couple things. A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE Node: csclprd3-0-7 #processes: 2 #cpus: 1 You can override this protection by adding the "overload-allowed" option to your binding directive. When using the SGE make parallel environment to submit jobs everything worked perfectly. I noticed when using the make PE, the number of slots allocated from each node to the job corresponded to the number of CPU's and disregarded any additional cores within a CPU and any hyperthreading cores. Here are the definitions of the two parallel environments tested (with orte always failing when more slots are requested than there are CPU cores on the first node allocated to the job by SGE): [root@csclprd3 ~]# qconf -sp orte pe_name orte slots 9999 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $fill_up control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary TRUE qsort_args NONE [root@csclprd3 ~]# qconf -sp make pe_name make slots 999 user_lists NONE xuser_lists NONE start_proc_args NONE stop_proc_args NONE allocation_rule $round_robin control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary TRUE qsort_args NONE Although everything seems to work with the make PE, I'd still like to know why? Because on a much older version of openMPI loaded on an older version of CentOS, SGE and ROCKS, using all physical cores, as well as all hyperthreads was never a problem (even on NUMA nodes). What is the recommended SGE parallel environment definition for OpenMPI 1.8.2? I apologize for the length of this, but I thought it best to provide more information than less. Thank you in advance, -Bill Lane ________________________________________ From: users [users-boun...@open-mpi.org] on behalf of Jeff Squyres (jsquyres) [jsquy...@cisco.com] Sent: Friday, August 08, 2014 5:25 AM To: Open MPI User's List Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots On Aug 8, 2014, at 1:24 AM, Lane, William <william.l...@cshs.org> wrote: > Using the "--mca btl tcp,self" switch to mpirun solved all the issues (in > addition to > the requirement to include the --mca btl_tcp_if_include eth0 switch). I > believe > the "--mca btl tcp,self" switch limits inter-process communication within a > node to using the TCP > loopback rather than shared memory. Correct. You will not be using shared memory for MPI communication at all -- just TCP. > I should also point out that all of the nodes > on this cluster feature NUMA architecture. > > Will using the "--mca btl tcp,self" switch to mpirun result in any degraded > performance > issues over using shared memory? Generally yes, but it depends on your application. If your application does very little MPI communication, then the difference between shared memory and TCP is likely negligible. I'd strongly suggest two things: - Upgrade to at least Open MPI 1.6.5 (1.8.x would be better, if possible) - Run your program through a memory-checking debugger such as Valgrind Seg faults like you initially described can be caused by errors in your MPI application itself -- the fact that using TCP only (and not shared memory) avoids the segvs does not mean that the issue is actually fixed; it may well mean that the error is still there, but is happening in a case that doesn't seem to cause enough damage to cause a segv. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ _______________________________________________ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/24951.php IMPORTANT WARNING: This message is intended for the use of the person or entity to which it is addressed and may contain information that is privileged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this information is STRICTLY PROHIBITED. If you have received this message in error, please notify us immediately by calling (310) 423-6428 and destroy the related message. Thank You for your cooperation.