It looks to me like your remote nodes aren't finding the orted executable. I suspect the problem is that you need to forward the path and ld_library_path tot he remove nodes. Use the mpirun -x option to do so.
On Oct 4, 2010, at 5:08 AM, Chris Jewell wrote: > Hi all, > > Firstly, hello to the mailing list for the first time! Secondly, sorry for > the non-descript subject line, but I couldn't really think how to be more > specific! > > Anyway, I am currently having a problem getting OpenMPI to work within my > installation of SGE 6.2u5. I compiled OpenMPI 1.4.2 from source, and > installed under /usr/local/packages/openmpi-1.4.2. Software on my system is > controlled by the Modules framework which adds the bin and lib directories to > PATH and LD_LIBRARY_PATH respectively when a user is connected to an > execution node. I configured a parallel environment in which OpenMPI is to > be used: > > pe_name mpi > slots 16 > user_lists NONE > xuser_lists NONE > start_proc_args /bin/true > stop_proc_args /bin/true > allocation_rule $round_robin > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > accounting_summary FALSE > > I then tried a simple job submission script: > > #!/bin/bash > # > #$ -S /bin/bash > . /etc/profile > module add ompi gcc > mpirun hostname > > If the parallel environment runs within one execution host (8 slots per > host), then all is fine. However, if scheduled across several nodes, I get > an error: > > execv: No such file or directory > execv: No such file or directory > execv: No such file or directory > -------------------------------------------------------------------------- > A daemon (pid 1629) died unexpectedly with status 1 while attempting > to launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -------------------------------------------------------------------------- > mpirun: clean termination accomplished > > > I'm at a loss on how to start debugging this, and I don't seem to be getting > anything useful using the mpirun '-d' and '-v' switches. SGE logs don't note > anything. Can anyone suggest either what is wrong, or how I might progress > with getting more information? > > Many thanks, > > > Chris > > > > -- > Dr Chris Jewell > Department of Statistics > University of Warwick > Coventry > CV4 7AL > UK > Tel: +44 (0)24 7615 0778 > > > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users