Here's a qrsh run of OpenMPI 1.8.7 that actually generated an error message, usually I get no output whatsoever (i.e. from stderr or stdout) from the job, and it eventually generates core dumps:
qrsh -V -now yes -pe orte 209 mpirun -np 209 -display-devel-map --prefix /hpc/apps/mpi/openmpi/1.8.7/ --mca btl ^sm --hetero-nodes --bind-to core /hpc/home/lanew/mpi/openmpi/ProcessColors3 -------------------------------------------------------------------------- WARNING: a request was made to bind a process. While the system supports binding the process itself, at least one node does NOT support binding memory to the process location. Node: csclprd3-4-2 This usually is due to not having the required NUMA support installed on the node. In some Linux distributions, the required support is contained in the libnumactl and libnumactl-devel packages. This is a warning only; your job will continue, though performance may be degraded. -------------------------------------------------------------------------- -------------------------------------------------------------------------- A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE Node: csclprd3-4-2 #processes: 2 #cpus: 1 You can override this protection by adding the "overload-allowed" option to your binding directive. -------------------------------------------------------------------------- I'm using CentOS 6.3 and Son-of-Gridengine as my scheduling agent. The relevant NUMA libraries have been installed to the cluster: csclprd3-4-2 ~]$ yum list installed *numa* Installed Packages numactl.x86_64 2.0.7-3.el6 @centos6.3-x86_64-0/$releasever numactl-devel.x86_64 Here's the lstopo output for the node in question (a x3550-M3 node w/6-core Westmere CPU's and hyperthreading): csclprd3-4-2 ~]$ lstopo Machine (96GB) NUMANode L#0 (P#0 48GB) + Socket L#0 + L3 L#0 (12MB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#12) L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 PU L#2 (P#1) PU L#3 (P#13) L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 PU L#4 (P#2) PU L#5 (P#14) L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 PU L#6 (P#3) PU L#7 (P#15) L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 PU L#8 (P#4) PU L#9 (P#16) L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 PU L#10 (P#5) PU L#11 (P#17) NUMANode L#1 (P#1 48GB) + Socket L#1 + L3 L#1 (12MB) L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 PU L#12 (P#6) PU L#13 (P#18) L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 PU L#14 (P#7) PU L#15 (P#19) L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 PU L#16 (P#8) PU L#17 (P#20) L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 PU L#18 (P#9) PU L#19 (P#21) L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 PU L#20 (P#10) PU L#21 (P#22) L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 PU L#22 (P#11) PU L#23 (P#23) I'm going to setup a PE that has the appropriate parameters for OpenMPI as described here:https://www.open-mpi.org/faq/?category=sge and re-test w/this PE as well as using the --leave-session-attached --mca plm_base_verbose 5 mpirun switches. -Bill L. ________________________________ From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain [r...@open-mpi.org] Sent: Wednesday, August 05, 2015 4:41 PM To: Open MPI Users Subject: Re: [OMPI users] Son of Grid Engine, Parallel Environments and OpenMPI 1.8.7 Well that stinks! Let me know what's going on and I'll take a look. FWIW, the best method is generally to configure OMPI with --enable-debug, and then run with "--leave-session-attached --mca plm_base_verbose 5". That will tell us what the launcher thinks it is doing and what the daemons think is wrong. On Wed, Aug 5, 2015 at 3:17 PM, Lane, William <william.l...@cshs.org<mailto:william.l...@cshs.org>> wrote: Actually, we're still having problems submitting OpenMPI 1.8.7 jobs to the cluster thru SGE (which we need to do in order to track usage stats on the cluster). I suppose I'll make a PE w/the appropriate settings and see if that makes a difference. -Bill L ________________________________ From: users [users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org>] on behalf of Ralph Castain [r...@open-mpi.org<mailto:r...@open-mpi.org>] Sent: Wednesday, August 05, 2015 1:18 PM To: Open MPI Users Subject: Re: [OMPI users] Son of Grid Engine, Parallel Environments and OpenMPI 1.8.7 You know, I honestly don't know - there is a patch in there for qsort, but I haven't checked it against SGE. Let us know if you hit a problem and we'll try to figure it out. Glad to hear your cluster is working - nice to have such challenges to shake the cobwebs out :-) On Wed, Aug 5, 2015 at 12:43 PM, Lane, William <william.l...@cshs.org<mailto:william.l...@cshs.org>> wrote: I read @ https://www.open-mpi.org/faq/?category=sge that for OpenMPI Parallel Environments there's a special consideration for Son of Grid Engine: '"qsort_args" is necessary with the Son of Grid Engine distribution, version 8.1.1 and later, and probably only applicable to it. For very old versions of SGE, omit "accounting_summary" too.' Does this requirement still hold true for OpenMPI 1.8.7? Because the webpage above only refers to much older versions of OpenMPI. I also want to thank Ralph for all his help in debugging the manifold problems w/our mixed bag cluster. -Bill Lane IMPORTANT WARNING: This message is intended for the use of the person or entity to which it is addressed and may contain information that is privileged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this information is strictly prohibited. Thank you for your cooperation. _______________________________________________ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/08/27397.php IMPORTANT WARNING: This message is intended for the use of the person or entity to which it is addressed and may contain information that is privileged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this information is strictly prohibited. Thank you for your cooperation. _______________________________________________ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/08/27402.php IMPORTANT WARNING: This message is intended for the use of the person or entity to which it is addressed and may contain information that is privileged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this information is strictly prohibited. Thank you for your cooperation.