The error message seems to imply that you don't have OMPI installed on all your 
nodes (because it didn't find /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted on a remote 
node).


On Mar 30, 2011, at 4:24 PM, Nehemiah Dacres wrote:

> I am trying to figure out why my jobs aren't getting distributed and need 
> some help. I have an install of sun cluster tools on Rockscluster 5.2 
> (essentially centos4u2). this user's account has its home dir shared via nfs. 
> I am getting some strange errors. here's an example run 
> 
> 
> [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/mpirun -np 3 -hostfile list 
> ./job2.sh 
> bash: /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted: No such file or directory
> --------------------------------------------------------------------------
> A daemon (pid 20362) died unexpectedly with status 127 while attempting
> to launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> mpirun: clean termination accomplished
> 
> [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/
> bin/        examples/   instrument/ man/        
> etc/        include/    lib/        share/      
> [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/orte
> orte-clean  orted       orte-iof    orte-ps     orterun     
> [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted 
> [therock.slu.loc:20365] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
> runtime/orte_init.c at line 125
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>   orte_ess_base_select failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> [therock.slu.loc:20365] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
> orted/orted_main.c at line 325
> [jian@therock ~]$ 
> 
> 
> -- 
> Nehemiah I. Dacres
> System Administrator 
> Advanced Technology Group Saint Louis University
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to