Hi,

Am 08.02.2013 um 19:36 schrieb Pierre LINDENBAUM:

> ( cross-posted on SO: http://stackoverflow.com/questions/14775451 )
> I'm very new to OpenMpi and I'm trying tosubmit OMPI to SGE:
> 
> 
> I've installed openmpi , not in
>  /usr/...
> but in
>   /commun/data/packages/openmpi/
> 
> it was compiled with --with-sge.
> 
> I've added a new PE in SGE with qconf as descibed in
> http://docs.oracle.com/cd/E19080-01/n1.grid.eng6/817-5677/6ml49n2c0/index.html
> 
>  # /commun/data/packages/openmpi/bin/ompi_info | grep gridengine
>  MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.3)
> 
>  # qconf -sq all.q | grep pe_
>  pe_list               make orte
> 
> Without SGE, the program runs without any problem, using several processors.
> 
>       /commun/data/packages/openmpi/bin/orterun -np 20 ./a.out args
> 
> Now I want to submit my program to SGE
> 
> In the Open MPI FAQ, I read:
> 
>  # Allocate a SGE interactive job with 4 slots
>  # from a parallel environment (PE) named 'orte'
>  shell$ qsh -pe orte 4
> 
> but my output is:
> 
>   qsh -pe orte 4
>   Your job 84550 ("INTERACTIVE") has been submitted
>   waiting for interactive job to be scheduled ...
>   Could not start interactive job.

An INTERACTIVE job is more like an immediate job, i.e. "-now y". Do you have 
any interactive queue configured and the cluster is empty right now?


> I've also tried the mpirun command embedded in a script:
> 
>   $ cat ompi.sh
>   #!/bin/sh
>   /commun/data/packages/openmpi/bin/mpirun  \
>         /path/to/a.out args
> 
> but it fails
> 
>  $ cat ompi.sh.e84552
>  error: executing task of job 84552 failed: execution daemon on host
> "node02" didn't accept task

This is a good sign, as it tries to use `qrsh -inherit ...` already. Can you 
confirm the following settings:

$ qconf -sp orte
...
control_slaves     TRUE

$ qconf -sq all.q
...
shell_start_mode      unix_behavior

-- Reuti


>   --------------------------------------------------------------------------
>  A daemon (pid 18327) died unexpectedly with status 1 while attempting
>  to launch so we are aborting.
> 
>  There may be more information reported by the environment (see above).
> 
>  This may be because the daemon was unable to find all the needed shared
>  libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>  location of the shared libraries on the remote nodes and this will
>  automatically be forwarded to the remote nodes.
>  --------------------------------------------------------------------------
>  error: executing task of job 84552 failed: execution daemon on host
> "node01" didn't accept task
>  --------------------------------------------------------------------------
>  mpirun noticed that the job aborted, but has no info as to the process
>  that caused that situation.
> 
> How can I fix this?
> 
> Many thanks
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to