you do realize that this is Sun Cluster Tools branch (it is a branch right? or is it a *port* of openmpi to sun's compilers?) I'm not sure if your changes made it into sunct 8.2.1
On Mon, Apr 4, 2011 at 9:34 AM, Ralph Castain <r...@open-mpi.org> wrote: > Guess I can/will add the node name to the error message - should have been > there before now. > > If it is a debug build, you can add "-mca plm_base_verbose 1" to the cmd > line and get output tracing the launch and showing you what nodes are having > problems. > > > On Apr 4, 2011, at 8:24 AM, Nehemiah Dacres wrote: > > I have installed it via a symlink on all of the nodes, I can go 'tentakel > which mpirun ' and it finds it' I'll check the library paths but isn't there > a way to find out which nodes are returning the error? > > > On Thu, Mar 31, 2011 at 7:30 AM, Jeff Squyres <jsquy...@cisco.com> wrote: > >> The error message seems to imply that you don't have OMPI installed on all >> your nodes (because it didn't find /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted on a >> remote node). >> >> >> On Mar 30, 2011, at 4:24 PM, Nehemiah Dacres wrote: >> >> > I am trying to figure out why my jobs aren't getting distributed and >> need some help. I have an install of sun cluster tools on Rockscluster 5.2 >> (essentially centos4u2). this user's account has its home dir shared via >> nfs. I am getting some strange errors. here's an example run >> > >> > >> > [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/mpirun -np 3 -hostfile >> list ./job2.sh >> > bash: /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted: No such file or directory >> > >> -------------------------------------------------------------------------- >> > A daemon (pid 20362) died unexpectedly with status 127 while attempting >> > to launch so we are aborting. >> > >> > There may be more information reported by the environment (see above). >> > >> > This may be because the daemon was unable to find all the needed shared >> > libraries on the remote node. You may set your LD_LIBRARY_PATH to have >> the >> > location of the shared libraries on the remote nodes and this will >> > automatically be forwarded to the remote nodes. >> > >> -------------------------------------------------------------------------- >> > >> -------------------------------------------------------------------------- >> > mpirun noticed that the job aborted, but has no info as to the process >> > that caused that situation. >> > >> -------------------------------------------------------------------------- >> > mpirun: clean termination accomplished >> > >> > [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/ >> > bin/ examples/ instrument/ man/ >> > etc/ include/ lib/ share/ >> > [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/orte >> > orte-clean orted orte-iof orte-ps orterun >> > [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted >> > [therock.slu.loc:20365] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in >> file runtime/orte_init.c at line 125 >> > >> -------------------------------------------------------------------------- >> > It looks like orte_init failed for some reason; your parallel process is >> > likely to abort. There are many reasons that a parallel process can >> > fail during orte_init; some of which are due to configuration or >> > environment problems. This failure appears to be an internal failure; >> > here's some additional information (which may only be relevant to an >> > Open MPI developer): >> > >> > orte_ess_base_select failed >> > --> Returned value Not found (-13) instead of ORTE_SUCCESS >> > >> -------------------------------------------------------------------------- >> > [therock.slu.loc:20365] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in >> file orted/orted_main.c at line 325 >> > [jian@therock ~]$ >> > >> > >> > -- >> > Nehemiah I. Dacres >> > System Administrator >> > Advanced Technology Group Saint Louis University >> > >> > _______________________________________________ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > > -- > Nehemiah I. Dacres > System Administrator > Advanced Technology Group Saint Louis University > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Nehemiah I. Dacres System Administrator Advanced Technology Group Saint Louis University