On Sat, Oct 11, 2008 at 5:34 PM, Pak Lui <p...@penguincomputing.com> wrote:
> It looks like from your earlier discussions on gridengine user alias
> that you are able to run a simple single queue SGE tightly integrated
> parallel job with Open MPI, it's just a matter of using multiple queues
> with your parallel job, right?
>
> http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=26298
>
> The tm messages are just a red herring. What's more interesting is the
> verbose messages from qrsh (the lines that you enable by using -mca
> pls_gridengine_verbose 1, with lines started without the stuff prepended
>  by OMPI, like [shakespeare:05720]).
>
>>> Starting server daemon at host "shakespeare.nci.nih.gov"
>>> Starting server daemon at host "octopus.nci.nih.gov"
>>> Server daemon successfully started with task id "1.shakespeare"
>>> [shakespeare:05733] mca: base: component_find: unable to open ras tm:
>>> file not found (ignored)
>>> [shakespeare:05733] mca: base: component_find: unable to open pls tm:
>>> file not found (ignored)
>>> error: executing task of job 3576 failed: failed sending task to
>>> ex...@octopus.nci.nih.gov: can't find connecti
>>> on
>
> Since you see these verbose messages here, it means that you are using
> "qrsh -inherit" in the backend for launching tasks. (You can also see
> the qrsh -inherit line by setting "-mca pls_gridnegine_debug 1" in mpirun.)
>
> You can also see the actual "qrsh -inherit" line by setting "-mca
> pls_gridnegine_debug 1" in mpirun.
>
> Those messages above show you that somehow when mpirun is trying to send
> the SGE tasks to the remote nodes to shakespeare and octopus via 2
> queues, shakespeare appears to start the server daemon successfully, but
> you don't seem to get the same message from octopus. Typically I see
> only 1 message from the server daemon when I use only 1 queue in my
> parallel job.
>
> In order for the head node's "qrsh -inherit" tasks to be accepted by SGE
> daemons on execution nodes, the execution daemons need to be
> allocated/notified ahead of time that there are impending tasks coming
> to the nodes.
>
> Anyway, I don't know why it needs to start the server daemon on octopus
> when you have 2 queues in your parallel job. But let's say it's the
> right behavior, SGE seems to have problem starting the task from the
> headnode shakespeare to octopus (therefore we are the "failed sending
> task to execd: can't find connection message). Did you already try
> connecting from shakespeare to octopus? You might also want to check out
> messages on octopus' log file $SGE_ROOT/default/spool/octopus/messages
> to see how exactly it isn't accepting the task.
>
> It may also be worthwhile to ask the gridengine folks if anyone has
> tried with parallel job on multiple queues. I am not sure how typical
> that people use this SGE feature.
>
> I don't have access to a SGE cluster but I notice from an online manual
> there's a new qsub option (-masterq) in SGE 6.2 that may work. You might
> want to give it a try. This looks more and more like an SGE issue not
> able to accept tasks from multiple queues for parallel job.
>
> btw, you don't need the --with-sge switch in OMPI configure. It's new in
> OMPI v1.3 so that we don't build SGE support by default.
>
> My $.02...

Thanks, Pak.  There is only one queue on the SGE system.  Of course,
there are queue instances for each machine, which is the usual for
SGE.

I'll give the -masterq a look.  And the messages files for the
involved machines are devoid of anything useful; in fact, there is no
mention of these jobs, in general.

Sean

>> Date: Sat, 11 Oct 2008 07:56:02 -0400
>> From: Jeff Squyres <jsquy...@cisco.com>
>> Subject: Re: [OMPI users] SGE tight integration and ?tm? protocol for
>>        start
>> To: Open MPI Users <us...@open-mpi.org>
>> Message-ID: <3e62159b-14b9-4d44-96f6-0345079bc...@cisco.com>
>> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>>
>> I don't know much/anything about SGE (I'll leave that to the Sun folks  on
>> this list to reply), but I can tell you about the tm plugins: tm is  the
>> protocol used by the PBS/Torque family of launchers.  It looks  like your
>> Open MPI was built with TM support, but when you launch,  it's likely unable
>> to find the support libraries that it needs to load  those plugins.
>>
>> This is probably fine in your case, since you want to use SGE, not TM.
>>
>>
>> On Oct 9, 2008, at 4:40 PM, Sean Davis wrote:
>>
>>> I am relatively new to OpenMPI and Sun Grid Engine parallel
>>> integration.  I have a small cluster that is running SGE6.2 on linux
>>> machines all using Intel Xeon processors.  I have installed OpenMPI
>>> 1.2.7 from source using the --with-sge switch.  Now, I am trying to
>>> troubleshoot some problems I am having.  I have created a simple job
>>> script:
>>>
>>> The job script looks like:
>>> #!/bin/bash
>>> #$ -S /bin/bash
>>> #$ -cwd
>>> mpirun --mca pls_gridengine_verbose 1 -np $NSLOTS hostname
>>>
>>> And the output on the error stream:
>>>>
>>>> more junksub.sh.e3574
>>>
>>> [shakespeare:05720] mca: base: component_find: unable to open ras tm:
>>> file not found (ignored)
>>> [shakespeare:05720] mca: base: component_find: unable to open pls tm:
>>> file not found (ignored)
>>> Starting server daemon at host "shakespeare.nci.nih.gov"
>>> Starting server daemon at host "octopus.nci.nih.gov"
>>> Server daemon successfully started with task id "1.shakespeare"
>>> [shakespeare:05733] mca: base: component_find: unable to open ras tm:
>>> file not found (ignored)
>>> [shakespeare:05733] mca: base: component_find: unable to open pls tm:
>>> file not found (ignored)
>>> error: executing task of job 3576 failed: failed sending task to
>>> ex...@octopus.nci.nih.gov: can't find connecti
>>> on
>>> [shakespeare:05720] ERROR: A daemon on node octopus.nci.nih.gov failed
>>> to start as expected.
>>> [shakespeare:05720] ERROR: There may be more information available  from
>>> [shakespeare:05720] ERROR: the 'qstat -t' command on the Grid Engine
>>>  tasks.
>>> [shakespeare:05720] ERROR: If the problem persists, please restart the
>>> [shakespeare:05720] ERROR: Grid Engine PE job
>>> [shakespeare:05720] ERROR: The daemon exited unexpectedly with  status 1.
>>>
>>> However, there is no output in any output stream.
>>>
>>> And if I log into shakespeare and qrsh -q all.q@octopus, I immediately
>>> get a slot, so there isn't a "direct" problem with connecting.
>>>
>>> As I got a hint from folks on the SGE mailing list, it appears that
>>> qrsh is not being used for job submission.  Any suggestions as to why
>>> this might be the case (or if it is the case)?
>>>
>>> Thanks,
>>> Sean
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to