On Sat, Oct 11, 2008 at 5:34 PM, Pak Lui <p...@penguincomputing.com> wrote: > It looks like from your earlier discussions on gridengine user alias > that you are able to run a simple single queue SGE tightly integrated > parallel job with Open MPI, it's just a matter of using multiple queues > with your parallel job, right? > > http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=26298 > > The tm messages are just a red herring. What's more interesting is the > verbose messages from qrsh (the lines that you enable by using -mca > pls_gridengine_verbose 1, with lines started without the stuff prepended > by OMPI, like [shakespeare:05720]). > >>> Starting server daemon at host "shakespeare.nci.nih.gov" >>> Starting server daemon at host "octopus.nci.nih.gov" >>> Server daemon successfully started with task id "1.shakespeare" >>> [shakespeare:05733] mca: base: component_find: unable to open ras tm: >>> file not found (ignored) >>> [shakespeare:05733] mca: base: component_find: unable to open pls tm: >>> file not found (ignored) >>> error: executing task of job 3576 failed: failed sending task to >>> ex...@octopus.nci.nih.gov: can't find connecti >>> on > > Since you see these verbose messages here, it means that you are using > "qrsh -inherit" in the backend for launching tasks. (You can also see > the qrsh -inherit line by setting "-mca pls_gridnegine_debug 1" in mpirun.) > > You can also see the actual "qrsh -inherit" line by setting "-mca > pls_gridnegine_debug 1" in mpirun. > > Those messages above show you that somehow when mpirun is trying to send > the SGE tasks to the remote nodes to shakespeare and octopus via 2 > queues, shakespeare appears to start the server daemon successfully, but > you don't seem to get the same message from octopus. Typically I see > only 1 message from the server daemon when I use only 1 queue in my > parallel job. > > In order for the head node's "qrsh -inherit" tasks to be accepted by SGE > daemons on execution nodes, the execution daemons need to be > allocated/notified ahead of time that there are impending tasks coming > to the nodes. > > Anyway, I don't know why it needs to start the server daemon on octopus > when you have 2 queues in your parallel job. But let's say it's the > right behavior, SGE seems to have problem starting the task from the > headnode shakespeare to octopus (therefore we are the "failed sending > task to execd: can't find connection message). Did you already try > connecting from shakespeare to octopus? You might also want to check out > messages on octopus' log file $SGE_ROOT/default/spool/octopus/messages > to see how exactly it isn't accepting the task. > > It may also be worthwhile to ask the gridengine folks if anyone has > tried with parallel job on multiple queues. I am not sure how typical > that people use this SGE feature. > > I don't have access to a SGE cluster but I notice from an online manual > there's a new qsub option (-masterq) in SGE 6.2 that may work. You might > want to give it a try. This looks more and more like an SGE issue not > able to accept tasks from multiple queues for parallel job. > > btw, you don't need the --with-sge switch in OMPI configure. It's new in > OMPI v1.3 so that we don't build SGE support by default. > > My $.02...
Thanks, Pak. There is only one queue on the SGE system. Of course, there are queue instances for each machine, which is the usual for SGE. I'll give the -masterq a look. And the messages files for the involved machines are devoid of anything useful; in fact, there is no mention of these jobs, in general. Sean >> Date: Sat, 11 Oct 2008 07:56:02 -0400 >> From: Jeff Squyres <jsquy...@cisco.com> >> Subject: Re: [OMPI users] SGE tight integration and ?tm? protocol for >> start >> To: Open MPI Users <us...@open-mpi.org> >> Message-ID: <3e62159b-14b9-4d44-96f6-0345079bc...@cisco.com> >> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes >> >> I don't know much/anything about SGE (I'll leave that to the Sun folks on >> this list to reply), but I can tell you about the tm plugins: tm is the >> protocol used by the PBS/Torque family of launchers. It looks like your >> Open MPI was built with TM support, but when you launch, it's likely unable >> to find the support libraries that it needs to load those plugins. >> >> This is probably fine in your case, since you want to use SGE, not TM. >> >> >> On Oct 9, 2008, at 4:40 PM, Sean Davis wrote: >> >>> I am relatively new to OpenMPI and Sun Grid Engine parallel >>> integration. I have a small cluster that is running SGE6.2 on linux >>> machines all using Intel Xeon processors. I have installed OpenMPI >>> 1.2.7 from source using the --with-sge switch. Now, I am trying to >>> troubleshoot some problems I am having. I have created a simple job >>> script: >>> >>> The job script looks like: >>> #!/bin/bash >>> #$ -S /bin/bash >>> #$ -cwd >>> mpirun --mca pls_gridengine_verbose 1 -np $NSLOTS hostname >>> >>> And the output on the error stream: >>>> >>>> more junksub.sh.e3574 >>> >>> [shakespeare:05720] mca: base: component_find: unable to open ras tm: >>> file not found (ignored) >>> [shakespeare:05720] mca: base: component_find: unable to open pls tm: >>> file not found (ignored) >>> Starting server daemon at host "shakespeare.nci.nih.gov" >>> Starting server daemon at host "octopus.nci.nih.gov" >>> Server daemon successfully started with task id "1.shakespeare" >>> [shakespeare:05733] mca: base: component_find: unable to open ras tm: >>> file not found (ignored) >>> [shakespeare:05733] mca: base: component_find: unable to open pls tm: >>> file not found (ignored) >>> error: executing task of job 3576 failed: failed sending task to >>> ex...@octopus.nci.nih.gov: can't find connecti >>> on >>> [shakespeare:05720] ERROR: A daemon on node octopus.nci.nih.gov failed >>> to start as expected. >>> [shakespeare:05720] ERROR: There may be more information available from >>> [shakespeare:05720] ERROR: the 'qstat -t' command on the Grid Engine >>> tasks. >>> [shakespeare:05720] ERROR: If the problem persists, please restart the >>> [shakespeare:05720] ERROR: Grid Engine PE job >>> [shakespeare:05720] ERROR: The daemon exited unexpectedly with status 1. >>> >>> However, there is no output in any output stream. >>> >>> And if I log into shakespeare and qrsh -q all.q@octopus, I immediately >>> get a slot, so there isn't a "direct" problem with connecting. >>> >>> As I got a hint from folks on the SGE mailing list, it appears that >>> qrsh is not being used for job submission. Any suggestions as to why >>> this might be the case (or if it is the case)? >>> >>> Thanks, >>> Sean >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >