Hmmm...just testing on my little cluster here on two nodes, it works just fine with 1.8.2:
[rhc@bend001 v1.8]$ mpirun -n 2 --map-by node ./a.out In rank 0 and host= bend001 Do Barrier call 1. In rank 0 and host= bend001 Do Barrier call 2. In rank 0 and host= bend001 Do Barrier call 3. In rank 1 and host= bend002 Do Barrier call 1. In rank 1 and host= bend002 Do Barrier call 2. In rank 1 and host= bend002 Do Barrier call 3. [rhc@bend001 v1.8]$ How are you configuring OMPI? On May 2, 2014, at 2:24 PM, Clay Kirkland <clay.kirkl...@versityinc.com> wrote: > I have been using MPI for many many years so I have very well debugged mpi > tests. I am > having trouble on either openmpi-1.4.5 or openmpi-1.6.5 versions though > with getting the > MPI_Barrier calls to work. It works fine when I run all processes on one > machine but when > I run with two or more hosts the second call to MPI_Barrier always hangs. > Not the first one, > but always the second one. I looked at FAQ's and such but found nothing > except for a comment > that MPI_Barrier problems were often problems with fire walls. Also > mentioned as a problem > was not having the same version of mpi on both machines. I turned firewalls > off and removed > and reinstalled the same version on both hosts but I still see the same > thing. I then installed > lam mpi on two of my machines and that works fine. I can call the > MPI_Barrier function when run on > one of two machines by itself many times with no hangs. Only hangs if two > or more hosts are involved. > These runs are all being done on CentOS release 6.4. Here is test program I > used. > > main (argc, argv) > int argc; > char **argv; > { > char message[20]; > char hoster[256]; > char nameis[256]; > int fd, i, j, jnp, iret, myrank, np, ranker, recker; > MPI_Comm comm; > MPI_Status status; > > MPI_Init( &argc, &argv ); > MPI_Comm_rank( MPI_COMM_WORLD, &myrank); > MPI_Comm_size( MPI_COMM_WORLD, &np); > > gethostname(hoster,256); > > printf(" In rank %d and host= %s Do Barrier call > 1.\n",myrank,hoster); > MPI_Barrier(MPI_COMM_WORLD); > printf(" In rank %d and host= %s Do Barrier call > 2.\n",myrank,hoster); > MPI_Barrier(MPI_COMM_WORLD); > printf(" In rank %d and host= %s Do Barrier call > 3.\n",myrank,hoster); > MPI_Barrier(MPI_COMM_WORLD); > MPI_Finalize(); > exit(0); > } > > Here are three runs of test program. First with two processes on one host, > then with > two processes on another host, and finally with one process on each of two > hosts. The > first two runs are fine but the last run hangs on the second MPI_Barrier. > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out > In rank 0 and host= centos Do Barrier call 1. > In rank 1 and host= centos Do Barrier call 1. > In rank 1 and host= centos Do Barrier call 2. > In rank 1 and host= centos Do Barrier call 3. > In rank 0 and host= centos Do Barrier call 2. > In rank 0 and host= centos Do Barrier call 3. > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out > /root/.bashrc: line 14: unalias: ls: not found > In rank 0 and host= RAID Do Barrier call 1. > In rank 0 and host= RAID Do Barrier call 2. > In rank 0 and host= RAID Do Barrier call 3. > In rank 1 and host= RAID Do Barrier call 1. > In rank 1 and host= RAID Do Barrier call 2. > In rank 1 and host= RAID Do Barrier call 3. > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos,RAID a.out > /root/.bashrc: line 14: unalias: ls: not found > In rank 0 and host= centos Do Barrier call 1. > In rank 0 and host= centos Do Barrier call 2. > In rank 1 and host= RAID Do Barrier call 1. > In rank 1 and host= RAID Do Barrier call 2. > > Since it is such a simple test and problem and such a widely used MPI > function, it must obviously > be an installation or configuration problem. A pstack for each of the hung > MPI_Barrier processes > on the two machines shows this: > > [root@centos ~]# pstack 31666 > #0 0x0000003baf0e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6 > #1 0x00007f5de06125eb in epoll_dispatch () from /usr/local/lib/libmpi.so.1 > #2 0x00007f5de061475a in opal_event_base_loop () from > /usr/local/lib/libmpi.so.1 > #3 0x00007f5de0639229 in opal_progress () from /usr/local/lib/libmpi.so.1 > #4 0x00007f5de0586f75 in ompi_request_default_wait_all () from > /usr/local/lib/libmpi.so.1 > #5 0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from > /usr/local/lib/openmpi/mca_coll_tuned.so > #6 0x00007f5ddc59d8ff in ompi_coll_tuned_barrier_intra_two_procs () from > /usr/local/lib/openmpi/mca_coll_tuned.so > #7 0x00007f5de05941c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1 > #8 0x0000000000400a43 in main () > > [root@RAID openmpi-1.6.5]# pstack 22167 > #0 0x00000030302e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6 > #1 0x00007f7ee46885eb in epoll_dispatch () from /usr/local/lib/libmpi.so.1 > #2 0x00007f7ee468a75a in opal_event_base_loop () from > /usr/local/lib/libmpi.so.1 > #3 0x00007f7ee46af229 in opal_progress () from /usr/local/lib/libmpi.so.1 > #4 0x00007f7ee45fcf75 in ompi_request_default_wait_all () from > /usr/local/lib/libmpi.so.1 > #5 0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from > /usr/local/lib/openmpi/mca_coll_tuned.so > #6 0x00007f7ee06138ff in ompi_coll_tuned_barrier_intra_two_procs () from > /usr/local/lib/openmpi/mca_coll_tuned.so > #7 0x00007f7ee460a1c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1 > #8 0x0000000000400a43 in main () > > Which looks exactly the same on each machine. Any thoughts or ideas would > be greatly appreciated as > I am stuck. > > Clay Kirkland > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users