It looks to me like your remote nodes aren't finding the orted executable. I 
suspect the problem is that you need to forward the path and ld_library_path 
tot he remove nodes. Use the mpirun -x option to do so.


On Oct 4, 2010, at 5:08 AM, Chris Jewell wrote:

> Hi all,
> 
> Firstly, hello to the mailing list for the first time!  Secondly, sorry for 
> the non-descript subject line, but I couldn't really think how to be more 
> specific!  
> 
> Anyway, I am currently having a problem getting OpenMPI to work within my 
> installation of SGE 6.2u5.  I compiled OpenMPI 1.4.2 from source, and 
> installed under /usr/local/packages/openmpi-1.4.2.  Software on my system is 
> controlled by the Modules framework which adds the bin and lib directories to 
> PATH and LD_LIBRARY_PATH respectively when a user is connected to an 
> execution node.  I configured a parallel environment in which OpenMPI is to 
> be used: 
> 
> pe_name            mpi
> slots              16
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /bin/true
> stop_proc_args     /bin/true
> allocation_rule    $round_robin
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary FALSE
> 
> I then tried a simple job submission script:
> 
> #!/bin/bash
> #
> #$ -S /bin/bash
> . /etc/profile
> module add ompi gcc
> mpirun hostname
> 
> If the parallel environment runs within one execution host (8 slots per 
> host), then all is fine.  However, if scheduled across  several nodes, I get 
> an error:
> 
> execv: No such file or directory
> execv: No such file or directory
> execv: No such file or directory
> --------------------------------------------------------------------------
> A daemon (pid 1629) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> mpirun: clean termination accomplished
> 
> 
> I'm at a loss on how to start debugging this, and I don't seem to be getting 
> anything useful using the mpirun '-d' and '-v' switches.  SGE logs don't note 
> anything.  Can anyone suggest either what is wrong, or how I might progress 
> with getting more information?
> 
> Many thanks,
> 
> 
> Chris
> 
> 
> 
> --
> Dr Chris Jewell
> Department of Statistics
> University of Warwick
> Coventry
> CV4 7AL
> UK
> Tel: +44 (0)24 7615 0778
> 
> 
> 
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to