On Apr 4, 2011, at 8:42 AM, Nehemiah Dacres wrote:

> you do realize that this is Sun Cluster Tools branch (it is a branch right? 
> or is it a *port* of openmpi to sun's compilers?) I'm not sure if your 
> changes made it into sunct 8.2.1 

My point was that the error message currently doesn't include the node name - 
not in the OMPI main code base, nor in the SCT port. So I will add it, which 
won't help you at the moment.

Hence my suggestion about using the param :-)


> 
> On Mon, Apr 4, 2011 at 9:34 AM, Ralph Castain <r...@open-mpi.org> wrote:
> Guess I can/will add the node name to the error message - should have been 
> there before now.
> 
> If it is a debug build, you can add "-mca plm_base_verbose 1" to the cmd line 
> and get output tracing the launch and showing you what nodes are having 
> problems.
> 
> 
> On Apr 4, 2011, at 8:24 AM, Nehemiah Dacres wrote:
> 
>> I have installed it via a symlink on all of the nodes, I can go 'tentakel 
>> which mpirun ' and it finds it' I'll check the library paths but isn't there 
>> a way to find out which nodes are returning the error? 
>> 
>> 
>> On Thu, Mar 31, 2011 at 7:30 AM, Jeff Squyres <jsquy...@cisco.com> wrote:
>> The error message seems to imply that you don't have OMPI installed on all 
>> your nodes (because it didn't find /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted on a 
>> remote node).
>> 
>> 
>> On Mar 30, 2011, at 4:24 PM, Nehemiah Dacres wrote:
>> 
>> > I am trying to figure out why my jobs aren't getting distributed and need 
>> > some help. I have an install of sun cluster tools on Rockscluster 5.2 
>> > (essentially centos4u2). this user's account has its home dir shared via 
>> > nfs. I am getting some strange errors. here's an example run
>> >
>> >
>> > [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/mpirun -np 3 -hostfile 
>> > list ./job2.sh
>> > bash: /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted: No such file or directory
>> > --------------------------------------------------------------------------
>> > A daemon (pid 20362) died unexpectedly with status 127 while attempting
>> > to launch so we are aborting.
>> >
>> > There may be more information reported by the environment (see above).
>> >
>> > This may be because the daemon was unable to find all the needed shared
>> > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> > location of the shared libraries on the remote nodes and this will
>> > automatically be forwarded to the remote nodes.
>> > --------------------------------------------------------------------------
>> > --------------------------------------------------------------------------
>> > mpirun noticed that the job aborted, but has no info as to the process
>> > that caused that situation.
>> > --------------------------------------------------------------------------
>> > mpirun: clean termination accomplished
>> >
>> > [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/
>> > bin/        examples/   instrument/ man/
>> > etc/        include/    lib/        share/
>> > [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/orte
>> > orte-clean  orted       orte-iof    orte-ps     orterun
>> > [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted
>> > [therock.slu.loc:20365] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in 
>> > file runtime/orte_init.c at line 125
>> > --------------------------------------------------------------------------
>> > It looks like orte_init failed for some reason; your parallel process is
>> > likely to abort.  There are many reasons that a parallel process can
>> > fail during orte_init; some of which are due to configuration or
>> > environment problems.  This failure appears to be an internal failure;
>> > here's some additional information (which may only be relevant to an
>> > Open MPI developer):
>> >
>> >   orte_ess_base_select failed
>> >   --> Returned value Not found (-13) instead of ORTE_SUCCESS
>> > --------------------------------------------------------------------------
>> > [therock.slu.loc:20365] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in 
>> > file orted/orted_main.c at line 325
>> > [jian@therock ~]$
>> >
>> >
>> > --
>> > Nehemiah I. Dacres
>> > System Administrator
>> > Advanced Technology Group Saint Louis University
>> >
>> > _______________________________________________
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> 
>> -- 
>> Nehemiah I. Dacres
>> System Administrator 
>> Advanced Technology Group Saint Louis University
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> -- 
> Nehemiah I. Dacres
> System Administrator 
> Advanced Technology Group Saint Louis University
> 

Reply via email to