Hmmm…puzzling. Unfortunately, the developers don’t have access to SGE machines, so I can only shoot in the dark here. Still, we can see what’s going on if you have patience with us.
Couple of things we can try so we get more info: * there is no MCA param pls_gridengine_verbose in the 1.8 series. Please set “-mca plm_base_verbose 5” on the cmd line * I take it you configured 1.8.7 with —enable-debug, yes? If not, please do so * Add “—leave-session-attached” to the cmd line * You state that this cluster has hetero nodes, but you didn’t set “—hetero-nodes” on the cmd line. You probably need to do so. * We set the following qrsh options “under the cover” when we launch via qrsh: opal_argv_append_nosize(&rsh_agent_argv, "-inherit"); /* Don't use the "-noshell" flag as qrsh would have a problem * swallowing a long command */ opal_argv_append_nosize(&rsh_agent_argv, "-nostdin"); opal_argv_append_nosize(&rsh_agent_argv, "-V"); In other words, mpirun will launch the orteds using a cmd line like “qrsh -inherit -nostdin -V orted”. Does your installation have an issue with any of those? If you get a chance to run with the above changes, please send along the output and I’ll see what can be done. Ralph > On Jul 30, 2015, at 2:19 PM, Lane, William <william.l...@cshs.org> wrote: > > I'm running a mixed cluster of Blades (HS21 and HS22 chassis), x3550-M3 and > X3550-M4 systems, some of which support hyperthreading, while others > don't (specifically the HS21 blades) all on CentOS 6.3 w/SGE. > > I have no problems running my simple OpenMPI 1.8.7 test code outside of SGE > (with or without the --bind-to core switch, but can only run the jobs within > SGE via qrsh on a limited number of slots (4 at most) successfully. The > errors are very similar to the ones I was getting running OpenMPI 1.8.5 - > 1.8.6 outside of SGE > on this same cluster. > > Strangely, when running the test code outside of SGE w/the --bind-to core > switch, mpirun still binds processes to hyperthreading cores. Additionally, > the --bind-to core switch prevents the OpenMPI 1.8.7 test code from running > at all within SGE (it outputs warnings about missing NUMA libraries reducing > performance > then exits). > > We would rather run out OpenMPI jobs from within SGE so that we can get > accounting data on OpenMPI jobs for administrative purposes. > > The orte PE I'm been using seems to meet all the requirements for previous > versions of OpenMPI: > the allocation rule is fill-up, rather than round-robin (I'm not sure if this > makes a difference or not) > > The value NONE in user_lists and xuser_lists mean enable everybody and > exclude nobody. > > The value of control_slaves must be TRUE; otherwise, qrsh exits with an error > message. > > The value of job_is_first_task must be FALSE or the job launcher consumes a > slot. In other words, mpirun itself will count as one of the slots and the > job will fail, because only n-1 processes will start. > > And be sure the queue will make use of the PE that you specified > > Below is the command line I've been using to generate the errors found in the > attached file out.txt: > > qrsh -V -now yes -pe orte 186 mpirun -np 186 --prefix > /hpc/apps/mpi/openmpi/1.8.7/ --mca btl_tcp_if_include eth0 --mca > pls_gridengine_verbose 1 /hpc/home/lanew/mpi/openmpi/ProcessColors3 >> > out.txt 2>&1 > > Sorry for the length. Thanks in advance for any help in resolving this > nagging issue (I wish we had a homogenous cluster now). > > -Bill L. > IMPORTANT WARNING: This message is intended for the use of the person or > entity to which it is addressed and may contain information that is > privileged and confidential, the disclosure of which is governed by > applicable law. If the reader of this message is not the intended recipient, > or the employee or agent responsible for delivering it to the intended > recipient, you are hereby notified that any dissemination, distribution or > copying of this information is strictly prohibited. Thank you for your > cooperation.<out.txt>_______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > <http://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/07/27363.php > <http://www.open-mpi.org/community/lists/users/2015/07/27363.php>