I am relatively new to OpenMPI and Sun Grid Engine parallel
integration.  I have a small cluster that is running SGE6.2 on linux
machines all using Intel Xeon processors.  I have installed OpenMPI
1.2.7 from source using the --with-sge switch.  Now, I am trying to
troubleshoot some problems I am having.  I have created a simple job
script:

The job script looks like:
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
mpirun --mca pls_gridengine_verbose 1 -np $NSLOTS hostname

And the output on the error stream:
> more junksub.sh.e3574
[shakespeare:05720] mca: base: component_find: unable to open ras tm:
file not found (ignored)
[shakespeare:05720] mca: base: component_find: unable to open pls tm:
file not found (ignored)
Starting server daemon at host "shakespeare.nci.nih.gov"
Starting server daemon at host "octopus.nci.nih.gov"
Server daemon successfully started with task id "1.shakespeare"
[shakespeare:05733] mca: base: component_find: unable to open ras tm:
file not found (ignored)
[shakespeare:05733] mca: base: component_find: unable to open pls tm:
file not found (ignored)
error: executing task of job 3576 failed: failed sending task to
ex...@octopus.nci.nih.gov: can't find connecti
on
[shakespeare:05720] ERROR: A daemon on node octopus.nci.nih.gov failed
to start as expected.
[shakespeare:05720] ERROR: There may be more information available from
[shakespeare:05720] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[shakespeare:05720] ERROR: If the problem persists, please restart the
[shakespeare:05720] ERROR: Grid Engine PE job
[shakespeare:05720] ERROR: The daemon exited unexpectedly with status 1.

However, there is no output in any output stream.

And if I log into shakespeare and qrsh -q all.q@octopus, I immediately
get a slot, so there isn't a "direct" problem with connecting.

As I got a hint from folks on the SGE mailing list, it appears that
qrsh is not being used for job submission.  Any suggestions as to why
this might be the case (or if it is the case)?

Thanks,
Sean

Reply via email to