> No, there are no others you need to set. Ralph's referring to the fact
> that we set OMPI environment variables in the processes that are
> started on the remote nodes.
> 
> I was asking to ensure you hadn't set any MCA parameters in the
> environment that could be creating a problem. Do you have any set in
> files, perchance?
> 
> And can you run "env | grep OMPI" from the script that you invoked via
> mpirun?
> 
> So just to be clear on the exact problem you're seeing:
> 
> - you mpirun on a single node and all works fine
> - you mpirun on multiple nodes and all works fine (e.g., mpirun --host
> a,b,c your_executable) - you mpirun on multiple nodes and list a host
> more than once and it hangs (e.g., mpirun --host a,a,b,c
> your_executable)
> 
> Is that correct?
> 
> If so, can you attach a debugger to one of the hung processes and see
> exactly where it's hung? (i.e., get the stack traces)
> 
> Per a question from your prior mail: yes, Open MPI does create mmapped
> files in /tmp for use with shared memory communication. They *should*
> get cleaned up when you exit, however, unless something disastrous
> happens. 

Thank you very much!

Now I am more clear with what Ralph asked. 

Yes what you described is right with the sm btl layer. As I double 
checked again, the problem is that when I use sm btl for MPI 
commnunication on the same host(as --mca btl openib,sm,self), 
issues come up as you described, all ran well on a single node, all 
ran well on multiple but different nodes, but it hang at MPI_Init() call 
if I ran on multiple nodes and list a host more than once. However, 
if I instead use tcp or openib btl without sm layer(as --mca btl 
openib,self), all these 3 cases ran just fine. 

I do setup the MCAs "plm_rsh_agent" to "rsh:ssh" and 
"btl_openib_warn_default_gid_prefix" to 0 in all cases, with or 
without sm btl layer. The OMPI environment variables set for each 
processes are quoted below(as output by env | grep OMPI in my 
script invoked by mpirun):

------
//process #0:

OMPI_MCA_plm_rsh_agent=rsh:ssh 
OMPI_MCA_btl_openib_warn_default_gid_prefix=0 
OMPI_MCA_btl=openib,sm,self 
OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5-
21784eac1fc85294 
OMPI_MCA_orte_local_daemon_uri=195559424.0;tcp://198.177.14
6.70:53997;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172
.33.10.1:53997 
OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997
;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53
997 OMPI_MCA_mpi_yield_when_idle=0 
OMPI_MCA_orte_app_num=0 OMPI_UNIVERSE_SIZE=4 
OMPI_MCA_ess=env OMPI_MCA_orte_ess_num_procs=4 
OMPI_COMM_WORLD_SIZE=4 
OMPI_COMM_WORLD_LOCAL_SIZE=2 
OMPI_MCA_orte_ess_jobid=195559425 
OMPI_MCA_orte_ess_vpid=0 OMPI_COMM_WORLD_RANK=0 
OMPI_COMM_WORLD_LOCAL_RANK=0

//process #1:

OMPI_MCA_plm_rsh_agent=rsh:ssh 
OMPI_MCA_btl_openib_warn_default_gid_prefix=0 
OMPI_MCA_btl=openib,sm,self 
OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5-
21784eac1fc85294 
OMPI_MCA_orte_local_daemon_uri=195559424.0;tcp://198.177.14
6.70:53997;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172
.33.10.1:53997 
OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997
;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53
997 OMPI_MCA_mpi_yield_when_idle=0 
OMPI_MCA_orte_app_num=1 OMPI_UNIVERSE_SIZE=4 
OMPI_MCA_ess=env OMPI_MCA_orte_ess_num_procs=4 
OMPI_COMM_WORLD_SIZE=4 
OMPI_COMM_WORLD_LOCAL_SIZE=2 
OMPI_MCA_orte_ess_jobid=195559425 
OMPI_MCA_orte_ess_vpid=1 OMPI_COMM_WORLD_RANK=1 
OMPI_COMM_WORLD_LOCAL_RANK=1

//process #3:

OMPI_MCA_plm_rsh_agent=rsh:ssh 
OMPI_MCA_btl_openib_warn_default_gid_prefix=0 
OMPI_MCA_btl=openib,sm,self 
OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5-
21784eac1fc85294 OMPI_MCA_orte_daemonize=1 
OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997
;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53
997 OMPI_MCA_ess=env OMPI_MCA_orte_ess_jobid=195559425 
OMPI_MCA_orte_ess_vpid=3 OMPI_MCA_orte_ess_num_procs=4 
OMPI_MCA_orte_local_daemon_uri=195559424.1;tcp://198.177.14
6.71:53290;tcp://10.10.10.1:53290;tcp://172.23.10.2:53290;tcp://172
.33.10.2:53290 OMPI_MCA_mpi_yield_when_idle=0 
OMPI_MCA_orte_app_num=3 OMPI_UNIVERSE_SIZE=4 
OMPI_COMM_WORLD_SIZE=4 
OMPI_COMM_WORLD_LOCAL_SIZE=2 
OMPI_COMM_WORLD_RANK=3 
OMPI_COMM_WORLD_LOCAL_RANK=1

//process #2:

OMPI_MCA_plm_rsh_agent=rsh:ssh 
OMPI_MCA_btl_openib_warn_default_gid_prefix=0 
OMPI_MCA_btl=openib,sm,self 
OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5-
21784eac1fc85294 OMPI_MCA_orte_daemonize=1 
OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997
;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53
997 OMPI_MCA_ess=env OMPI_MCA_orte_ess_jobid=195559425 
OMPI_MCA_orte_ess_vpid=2 OMPI_MCA_orte_ess_num_procs=4 
OMPI_MCA_orte_local_daemon_uri=195559424.1;tcp://198.177.14
6.71:53290;tcp://10.10.10.1:53290;tcp://172.23.10.2:53290;tcp://172
.33.10.2:53290 OMPI_MCA_mpi_yield_when_idle=0 
OMPI_MCA_orte_app_num=2 OMPI_UNIVERSE_SIZE=4 
OMPI_COMM_WORLD_SIZE=4 
OMPI_COMM_WORLD_LOCAL_SIZE=2 
OMPI_COMM_WORLD_RANK=2 
OMPI_COMM_WORLD_LOCAL_RANK=0

------
process #0 and #1 are on the same host, while process #2 and #3 
are on the other.

When I use sm btl layer, my program just hang at the MPI_Init() at 
the very beginning. 

I wish I made myself clear.

Thanks,
Yiguang


Reply via email to