that's an excellent suggestion On Mon, Apr 4, 2011 at 9:45 AM, Jeff Squyres <jsquy...@cisco.com> wrote:
> As Ralph indicated, he'll add the hostname to the error message (but that > might be tricky; that error message is coming from rsh/ssh...). > > In the meantime, you might try (csh style): > > foreach host (`cat list`) > echo $host > ls -l /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted > end > > > > On Apr 4, 2011, at 10:24 AM, Nehemiah Dacres wrote: > > > I have installed it via a symlink on all of the nodes, I can go 'tentakel > which mpirun ' and it finds it' I'll check the library paths but isn't there > a way to find out which nodes are returning the error? > > > > > > On Thu, Mar 31, 2011 at 7:30 AM, Jeff Squyres <jsquy...@cisco.com> > wrote: > > The error message seems to imply that you don't have OMPI installed on > all your nodes (because it didn't find /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted > on a remote node). > > > > > > On Mar 30, 2011, at 4:24 PM, Nehemiah Dacres wrote: > > > > > I am trying to figure out why my jobs aren't getting distributed and > need some help. I have an install of sun cluster tools on Rockscluster 5.2 > (essentially centos4u2). this user's account has its home dir shared via > nfs. I am getting some strange errors. here's an example run > > > > > > > > > [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/mpirun -np 3 > -hostfile list ./job2.sh > > > bash: /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted: No such file or directory > > > > -------------------------------------------------------------------------- > > > A daemon (pid 20362) died unexpectedly with status 127 while attempting > > > to launch so we are aborting. > > > > > > There may be more information reported by the environment (see above). > > > > > > This may be because the daemon was unable to find all the needed shared > > > libraries on the remote node. You may set your LD_LIBRARY_PATH to have > the > > > location of the shared libraries on the remote nodes and this will > > > automatically be forwarded to the remote nodes. > > > > -------------------------------------------------------------------------- > > > > -------------------------------------------------------------------------- > > > mpirun noticed that the job aborted, but has no info as to the process > > > that caused that situation. > > > > -------------------------------------------------------------------------- > > > mpirun: clean termination accomplished > > > > > > [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/ > > > bin/ examples/ instrument/ man/ > > > etc/ include/ lib/ share/ > > > [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/orte > > > orte-clean orted orte-iof orte-ps orterun > > > [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted > > > [therock.slu.loc:20365] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found > in file runtime/orte_init.c at line 125 > > > > -------------------------------------------------------------------------- > > > It looks like orte_init failed for some reason; your parallel process > is > > > likely to abort. There are many reasons that a parallel process can > > > fail during orte_init; some of which are due to configuration or > > > environment problems. This failure appears to be an internal failure; > > > here's some additional information (which may only be relevant to an > > > Open MPI developer): > > > > > > orte_ess_base_select failed > > > --> Returned value Not found (-13) instead of ORTE_SUCCESS > > > > -------------------------------------------------------------------------- > > > [therock.slu.loc:20365] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found > in file orted/orted_main.c at line 325 > > > [jian@therock ~]$ > > > > > > > > > -- > > > Nehemiah I. Dacres > > > System Administrator > > > Advanced Technology Group Saint Louis University > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > -- > > Nehemiah I. Dacres > > System Administrator > > Advanced Technology Group Saint Louis University > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Nehemiah I. Dacres System Administrator Advanced Technology Group Saint Louis University