Does this happen when you run without '-am ft-enable-cr' (so a no-C/R run)?

This will help us determine if your problem is with the C/R work or with the ORTE runtime. I suspect that there is something odd with your system that is confusing the runtime (so not a C/R problem).

Have you made sure to remove the previous versions of Open MPI from all machines on your cluster, before installing the new version? Sometimes problems like this come up because of mismatches in Open MPI versions on a machine.

-- Josh

On Mar 23, 2010, at 5:42 PM, fengguang tian wrote:

I met the same problem with this 
link:http://www.open-mpi.org/community/lists/users/2009/12/11374.php

in the link, they give a solution that use v1.4 open mpi instead of v1.3 open mpi. but, I am using v1.7a1r22794 open mpi, and met the same problem.
here is what I have done:
my cluster composed of two machines:nimbus(master) and nimbus1(slave), when I run mpirun -np 40 -am ft-enable-cr -- hostfile .mpihostfile myapplication
on the nimbus, and it doesn't work, it shows:

[nimbus1:21387] opal_os_dirpath_create: Error: Unable to create the sub-directory (/tmp/openmpi-sessions-mpiu@nimbus1_0/59759) of (/tmp/ openmpi-sessions-mpiu@nimbus1_0/59759/0/1), mkdir failed [1] [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file util/ session_dir.c at line 106 [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file util/ session_dir.c at line 399 [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file base/ ess_base_std_orted.c at line 301 [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 104 [nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID] [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file util/show_help.c at line 602 [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file ess_env_module.c at line 143 [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 104 [nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID] [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file util/show_help.c at line 602 [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file runtime/ orte_init.c at line 129 [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 104 [nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID] [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file util/show_help.c at line 602 [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file orted/ orted_main.c at line 355
--------------------------------------------------------------------------
A daemon (pid 10737) died unexpectedly with status 255 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------


cheers
fengguang
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to