As one of the error message suggests, you need to add the openmpi library to your LD_LIBRARY_PATH to all your nodes.
On Wed, Mar 30, 2011 at 1:24 PM, Nehemiah Dacres <dacre...@slu.edu> wrote: > I am trying to figure out why my jobs aren't getting distributed and need > some help. I have an install of sun cluster tools on Rockscluster 5.2 > (essentially centos4u2). this user's account has its home dir shared via > nfs. I am getting some strange errors. here's an example run > > > [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/mpirun -np 3 -hostfile > list ./job2.sh > bash: /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted: No such file or directory > -------------------------------------------------------------------------- > A daemon (pid 20362) died unexpectedly with status 127 while attempting > to launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -------------------------------------------------------------------------- > mpirun: clean termination accomplished > > [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/ > bin/ examples/ instrument/ man/ > etc/ include/ lib/ share/ > [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/orte > orte-clean orted orte-iof orte-ps orterun > [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted > [therock.slu.loc:20365] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in > file runtime/orte_init.c at line 125 > -------------------------------------------------------------------------- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > orte_ess_base_select failed > --> Returned value Not found (-13) instead of ORTE_SUCCESS > -------------------------------------------------------------------------- > [therock.slu.loc:20365] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in > file orted/orted_main.c at line 325 > [jian@therock ~]$ > > > -- > Nehemiah I. Dacres > System Administrator > Advanced Technology Group Saint Louis University > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- David Zhang University of California, San Diego