Here's a qrsh run of OpenMPI 1.8.7 that actually generated an error message, 
usually
I get no output whatsoever (i.e. from stderr or stdout) from the job, and it 
eventually
generates core dumps:

qrsh -V -now yes -pe orte 209 mpirun -np 209 -display-devel-map --prefix 
/hpc/apps/mpi/openmpi/1.8.7/ --mca btl ^sm --hetero-nodes --bind-to core 
/hpc/home/lanew/mpi/openmpi/ProcessColors3
--------------------------------------------------------------------------
WARNING: a request was made to bind a process. While the system
supports binding the process itself, at least one node does NOT
support binding memory to the process location.

  Node:  csclprd3-4-2

This usually is due to not having the required NUMA support installed
on the node. In some Linux distributions, the required support is
contained in the libnumactl and libnumactl-devel packages.
This is a warning only; your job will continue, though performance may be 
degraded.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        csclprd3-4-2
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------

I'm using CentOS 6.3 and Son-of-Gridengine as my scheduling agent.

The relevant NUMA libraries have been installed to the cluster:

csclprd3-4-2 ~]$ yum list installed *numa*
Installed Packages
numactl.x86_64                                                2.0.7-3.el6       
                                    @centos6.3-x86_64-0/$releasever
numactl-devel.x86_64

Here's the lstopo output for the node in question (a x3550-M3 node w/6-core 
Westmere CPU's and hyperthreading):
csclprd3-4-2 ~]$ lstopo
Machine (96GB)
  NUMANode L#0 (P#0 48GB) + Socket L#0 + L3 L#0 (12MB)
    L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#12)
    L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
      PU L#2 (P#1)
      PU L#3 (P#13)
    L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
      PU L#4 (P#2)
      PU L#5 (P#14)
    L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
      PU L#6 (P#3)
      PU L#7 (P#15)
    L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
      PU L#8 (P#4)
      PU L#9 (P#16)
    L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
      PU L#10 (P#5)
      PU L#11 (P#17)
  NUMANode L#1 (P#1 48GB) + Socket L#1 + L3 L#1 (12MB)
    L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
      PU L#12 (P#6)
      PU L#13 (P#18)
    L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
      PU L#14 (P#7)
      PU L#15 (P#19)
    L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
      PU L#16 (P#8)
      PU L#17 (P#20)
    L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
      PU L#18 (P#9)
      PU L#19 (P#21)
    L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
      PU L#20 (P#10)
      PU L#21 (P#22)
    L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
      PU L#22 (P#11)
      PU L#23 (P#23)

I'm going to setup a PE that has the appropriate parameters for OpenMPI
as described here:https://www.open-mpi.org/faq/?category=sge

and re-test w/this PE as well as using the --leave-session-attached --mca 
plm_base_verbose 5
mpirun switches.

-Bill L.

________________________________
From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain 
[r...@open-mpi.org]
Sent: Wednesday, August 05, 2015 4:41 PM
To: Open MPI Users
Subject: Re: [OMPI users] Son of Grid Engine, Parallel Environments and OpenMPI 
1.8.7

Well that stinks! Let me know what's going on and I'll take a look. FWIW, the 
best method is generally to configure OMPI with --enable-debug, and then run 
with "--leave-session-attached --mca plm_base_verbose 5". That will tell us 
what the launcher thinks it is doing and what the daemons think is wrong.


On Wed, Aug 5, 2015 at 3:17 PM, Lane, William 
<william.l...@cshs.org<mailto:william.l...@cshs.org>> wrote:
Actually, we're still having problems submitting OpenMPI 1.8.7 jobs
to the cluster thru SGE (which we need to do in order to track usage
stats on the cluster). I suppose I'll make a PE w/the appropriate settings
and see if that makes a difference.

-Bill L

________________________________
From: users [users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org>] on 
behalf of Ralph Castain [r...@open-mpi.org<mailto:r...@open-mpi.org>]
Sent: Wednesday, August 05, 2015 1:18 PM
To: Open MPI Users
Subject: Re: [OMPI users] Son of Grid Engine, Parallel Environments and OpenMPI 
1.8.7

You know, I honestly don't know - there is a patch in there for qsort, but I 
haven't checked it against SGE. Let us know if you hit a problem and we'll try 
to figure it out.

Glad to hear your cluster is working - nice to have such challenges to shake 
the cobwebs out :-)

On Wed, Aug 5, 2015 at 12:43 PM, Lane, William 
<william.l...@cshs.org<mailto:william.l...@cshs.org>> wrote:
I read @

https://www.open-mpi.org/faq/?category=sge

that for OpenMPI Parallel Environments there's
a special consideration for Son of Grid Engine:

   '"qsort_args" is necessary with the Son of Grid Engine distribution,
   version 8.1.1 and later, and probably only applicable to it.  For
   very old versions of SGE, omit "accounting_summary" too.'

Does this requirement still hold true for OpenMPI 1.8.7? Because
the webpage above only refers to much older versions of OpenMPI.

I also want to thank Ralph for all his help in debugging the manifold
problems w/our mixed bag cluster.

-Bill Lane




IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
strictly prohibited. Thank you for your cooperation.

_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/08/27397.php

IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
strictly prohibited. Thank you for your cooperation.

_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/08/27402.php

IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
strictly prohibited. Thank you for your cooperation.

Reply via email to