> No, there are no others you need to set. Ralph's referring to the fact > that we set OMPI environment variables in the processes that are > started on the remote nodes. > > I was asking to ensure you hadn't set any MCA parameters in the > environment that could be creating a problem. Do you have any set in > files, perchance? > > And can you run "env | grep OMPI" from the script that you invoked via > mpirun? > > So just to be clear on the exact problem you're seeing: > > - you mpirun on a single node and all works fine > - you mpirun on multiple nodes and all works fine (e.g., mpirun --host > a,b,c your_executable) - you mpirun on multiple nodes and list a host > more than once and it hangs (e.g., mpirun --host a,a,b,c > your_executable) > > Is that correct? > > If so, can you attach a debugger to one of the hung processes and see > exactly where it's hung? (i.e., get the stack traces) > > Per a question from your prior mail: yes, Open MPI does create mmapped > files in /tmp for use with shared memory communication. They *should* > get cleaned up when you exit, however, unless something disastrous > happens.
Thank you very much! Now I am more clear with what Ralph asked. Yes what you described is right with the sm btl layer. As I double checked again, the problem is that when I use sm btl for MPI commnunication on the same host(as --mca btl openib,sm,self), issues come up as you described, all ran well on a single node, all ran well on multiple but different nodes, but it hang at MPI_Init() call if I ran on multiple nodes and list a host more than once. However, if I instead use tcp or openib btl without sm layer(as --mca btl openib,self), all these 3 cases ran just fine. I do setup the MCAs "plm_rsh_agent" to "rsh:ssh" and "btl_openib_warn_default_gid_prefix" to 0 in all cases, with or without sm btl layer. The OMPI environment variables set for each processes are quoted below(as output by env | grep OMPI in my script invoked by mpirun): ------ //process #0: OMPI_MCA_plm_rsh_agent=rsh:ssh OMPI_MCA_btl_openib_warn_default_gid_prefix=0 OMPI_MCA_btl=openib,sm,self OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5- 21784eac1fc85294 OMPI_MCA_orte_local_daemon_uri=195559424.0;tcp://198.177.14 6.70:53997;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172 .33.10.1:53997 OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997 ;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53 997 OMPI_MCA_mpi_yield_when_idle=0 OMPI_MCA_orte_app_num=0 OMPI_UNIVERSE_SIZE=4 OMPI_MCA_ess=env OMPI_MCA_orte_ess_num_procs=4 OMPI_COMM_WORLD_SIZE=4 OMPI_COMM_WORLD_LOCAL_SIZE=2 OMPI_MCA_orte_ess_jobid=195559425 OMPI_MCA_orte_ess_vpid=0 OMPI_COMM_WORLD_RANK=0 OMPI_COMM_WORLD_LOCAL_RANK=0 //process #1: OMPI_MCA_plm_rsh_agent=rsh:ssh OMPI_MCA_btl_openib_warn_default_gid_prefix=0 OMPI_MCA_btl=openib,sm,self OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5- 21784eac1fc85294 OMPI_MCA_orte_local_daemon_uri=195559424.0;tcp://198.177.14 6.70:53997;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172 .33.10.1:53997 OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997 ;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53 997 OMPI_MCA_mpi_yield_when_idle=0 OMPI_MCA_orte_app_num=1 OMPI_UNIVERSE_SIZE=4 OMPI_MCA_ess=env OMPI_MCA_orte_ess_num_procs=4 OMPI_COMM_WORLD_SIZE=4 OMPI_COMM_WORLD_LOCAL_SIZE=2 OMPI_MCA_orte_ess_jobid=195559425 OMPI_MCA_orte_ess_vpid=1 OMPI_COMM_WORLD_RANK=1 OMPI_COMM_WORLD_LOCAL_RANK=1 //process #3: OMPI_MCA_plm_rsh_agent=rsh:ssh OMPI_MCA_btl_openib_warn_default_gid_prefix=0 OMPI_MCA_btl=openib,sm,self OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5- 21784eac1fc85294 OMPI_MCA_orte_daemonize=1 OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997 ;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53 997 OMPI_MCA_ess=env OMPI_MCA_orte_ess_jobid=195559425 OMPI_MCA_orte_ess_vpid=3 OMPI_MCA_orte_ess_num_procs=4 OMPI_MCA_orte_local_daemon_uri=195559424.1;tcp://198.177.14 6.71:53290;tcp://10.10.10.1:53290;tcp://172.23.10.2:53290;tcp://172 .33.10.2:53290 OMPI_MCA_mpi_yield_when_idle=0 OMPI_MCA_orte_app_num=3 OMPI_UNIVERSE_SIZE=4 OMPI_COMM_WORLD_SIZE=4 OMPI_COMM_WORLD_LOCAL_SIZE=2 OMPI_COMM_WORLD_RANK=3 OMPI_COMM_WORLD_LOCAL_RANK=1 //process #2: OMPI_MCA_plm_rsh_agent=rsh:ssh OMPI_MCA_btl_openib_warn_default_gid_prefix=0 OMPI_MCA_btl=openib,sm,self OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5- 21784eac1fc85294 OMPI_MCA_orte_daemonize=1 OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997 ;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53 997 OMPI_MCA_ess=env OMPI_MCA_orte_ess_jobid=195559425 OMPI_MCA_orte_ess_vpid=2 OMPI_MCA_orte_ess_num_procs=4 OMPI_MCA_orte_local_daemon_uri=195559424.1;tcp://198.177.14 6.71:53290;tcp://10.10.10.1:53290;tcp://172.23.10.2:53290;tcp://172 .33.10.2:53290 OMPI_MCA_mpi_yield_when_idle=0 OMPI_MCA_orte_app_num=2 OMPI_UNIVERSE_SIZE=4 OMPI_COMM_WORLD_SIZE=4 OMPI_COMM_WORLD_LOCAL_SIZE=2 OMPI_COMM_WORLD_RANK=2 OMPI_COMM_WORLD_LOCAL_RANK=0 ------ process #0 and #1 are on the same host, while process #2 and #3 are on the other. When I use sm btl layer, my program just hang at the MPI_Init() at the very beginning. I wish I made myself clear. Thanks, Yiguang