I am configuring with all defaults. Just doing a ./configure and then make and make install. I have used open mpi on several kinds of unix systems this way and have had no trouble before. I believe I last had success on a redhat version of linux.
On Sat, May 3, 2014 at 11:00 AM, <users-requ...@open-mpi.org> wrote: > Send users mailing list submissions to > us...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.open-mpi.org/mailman/listinfo.cgi/users > or, via email, send a message with subject or body 'help' to > users-requ...@open-mpi.org > > You can reach the person managing the list at > users-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of users digest..." > > > Today's Topics: > > 1. MPI_Barrier hangs on second attempt but only when multiple > hosts used. (Clay Kirkland) > 2. Re: MPI_Barrier hangs on second attempt but only when > multiple hosts used. (Ralph Castain) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 2 May 2014 16:24:04 -0500 > From: Clay Kirkland <clay.kirkl...@versityinc.com> > To: us...@open-mpi.org > Subject: [OMPI users] MPI_Barrier hangs on second attempt but only > when multiple hosts used. > Message-ID: > <CAJDnjA8Wi=FEjz6Vz+Bc34b+nFE= > tf4b7g0bqgmbekg7h-p...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > I have been using MPI for many many years so I have very well debugged mpi > tests. I am > having trouble on either openmpi-1.4.5 or openmpi-1.6.5 versions though > with getting the > MPI_Barrier calls to work. It works fine when I run all processes on one > machine but when > I run with two or more hosts the second call to MPI_Barrier always hangs. > Not the first one, > but always the second one. I looked at FAQ's and such but found nothing > except for a comment > that MPI_Barrier problems were often problems with fire walls. Also > mentioned as a problem > was not having the same version of mpi on both machines. I turned > firewalls off and removed > and reinstalled the same version on both hosts but I still see the same > thing. I then installed > lam mpi on two of my machines and that works fine. I can call the > MPI_Barrier function when run on > one of two machines by itself many times with no hangs. Only hangs if two > or more hosts are involved. > These runs are all being done on CentOS release 6.4. Here is test program > I used. > > main (argc, argv) > int argc; > char **argv; > { > char message[20]; > char hoster[256]; > char nameis[256]; > int fd, i, j, jnp, iret, myrank, np, ranker, recker; > MPI_Comm comm; > MPI_Status status; > > MPI_Init( &argc, &argv ); > MPI_Comm_rank( MPI_COMM_WORLD, &myrank); > MPI_Comm_size( MPI_COMM_WORLD, &np); > > gethostname(hoster,256); > > printf(" In rank %d and host= %s Do Barrier call > 1.\n",myrank,hoster); > MPI_Barrier(MPI_COMM_WORLD); > printf(" In rank %d and host= %s Do Barrier call > 2.\n",myrank,hoster); > MPI_Barrier(MPI_COMM_WORLD); > printf(" In rank %d and host= %s Do Barrier call > 3.\n",myrank,hoster); > MPI_Barrier(MPI_COMM_WORLD); > MPI_Finalize(); > exit(0); > } > > Here are three runs of test program. First with two processes on one > host, then with > two processes on another host, and finally with one process on each of two > hosts. The > first two runs are fine but the last run hangs on the second MPI_Barrier. > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out > In rank 0 and host= centos Do Barrier call 1. > In rank 1 and host= centos Do Barrier call 1. > In rank 1 and host= centos Do Barrier call 2. > In rank 1 and host= centos Do Barrier call 3. > In rank 0 and host= centos Do Barrier call 2. > In rank 0 and host= centos Do Barrier call 3. > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out > /root/.bashrc: line 14: unalias: ls: not found > In rank 0 and host= RAID Do Barrier call 1. > In rank 0 and host= RAID Do Barrier call 2. > In rank 0 and host= RAID Do Barrier call 3. > In rank 1 and host= RAID Do Barrier call 1. > In rank 1 and host= RAID Do Barrier call 2. > In rank 1 and host= RAID Do Barrier call 3. > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos,RAID a.out > /root/.bashrc: line 14: unalias: ls: not found > In rank 0 and host= centos Do Barrier call 1. > In rank 0 and host= centos Do Barrier call 2. > In rank 1 and host= RAID Do Barrier call 1. > In rank 1 and host= RAID Do Barrier call 2. > > Since it is such a simple test and problem and such a widely used MPI > function, it must obviously > be an installation or configuration problem. A pstack for each of the > hung MPI_Barrier processes > on the two machines shows this: > > [root@centos ~]# pstack 31666 > #0 0x0000003baf0e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6 > #1 0x00007f5de06125eb in epoll_dispatch () from /usr/local/lib/libmpi.so.1 > #2 0x00007f5de061475a in opal_event_base_loop () from > /usr/local/lib/libmpi.so.1 > #3 0x00007f5de0639229 in opal_progress () from /usr/local/lib/libmpi.so.1 > #4 0x00007f5de0586f75 in ompi_request_default_wait_all () from > /usr/local/lib/libmpi.so.1 > #5 0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from > /usr/local/lib/openmpi/mca_coll_tuned.so > #6 0x00007f5ddc59d8ff in ompi_coll_tuned_barrier_intra_two_procs () from > /usr/local/lib/openmpi/mca_coll_tuned.so > #7 0x00007f5de05941c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1 > #8 0x0000000000400a43 in main () > > [root@RAID openmpi-1.6.5]# pstack 22167 > #0 0x00000030302e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6 > #1 0x00007f7ee46885eb in epoll_dispatch () from /usr/local/lib/libmpi.so.1 > #2 0x00007f7ee468a75a in opal_event_base_loop () from > /usr/local/lib/libmpi.so.1 > #3 0x00007f7ee46af229 in opal_progress () from /usr/local/lib/libmpi.so.1 > #4 0x00007f7ee45fcf75 in ompi_request_default_wait_all () from > /usr/local/lib/libmpi.so.1 > #5 0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from > /usr/local/lib/openmpi/mca_coll_tuned.so > #6 0x00007f7ee06138ff in ompi_coll_tuned_barrier_intra_two_procs () from > /usr/local/lib/openmpi/mca_coll_tuned.so > #7 0x00007f7ee460a1c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1 > #8 0x0000000000400a43 in main () > > Which looks exactly the same on each machine. Any thoughts or ideas would > be greatly appreciated as > I am stuck. > > Clay Kirkland > -------------- next part -------------- > HTML attachment scrubbed and removed > > ------------------------------ > > Message: 2 > Date: Sat, 3 May 2014 06:39:20 -0700 > From: Ralph Castain <r...@open-mpi.org> > To: Open MPI Users <us...@open-mpi.org> > Subject: Re: [OMPI users] MPI_Barrier hangs on second attempt but only > when multiple hosts used. > Message-ID: <3cf53d73-15d9-40bb-a2de-50ba3561a...@open-mpi.org> > Content-Type: text/plain; charset="us-ascii" > > Hmmm...just testing on my little cluster here on two nodes, it works just > fine with 1.8.2: > > [rhc@bend001 v1.8]$ mpirun -n 2 --map-by node ./a.out > In rank 0 and host= bend001 Do Barrier call 1. > In rank 0 and host= bend001 Do Barrier call 2. > In rank 0 and host= bend001 Do Barrier call 3. > In rank 1 and host= bend002 Do Barrier call 1. > In rank 1 and host= bend002 Do Barrier call 2. > In rank 1 and host= bend002 Do Barrier call 3. > [rhc@bend001 v1.8]$ > > > How are you configuring OMPI? > > > On May 2, 2014, at 2:24 PM, Clay Kirkland <clay.kirkl...@versityinc.com> > wrote: > > > I have been using MPI for many many years so I have very well debugged > mpi tests. I am > > having trouble on either openmpi-1.4.5 or openmpi-1.6.5 versions > though with getting the > > MPI_Barrier calls to work. It works fine when I run all processes on > one machine but when > > I run with two or more hosts the second call to MPI_Barrier always > hangs. Not the first one, > > but always the second one. I looked at FAQ's and such but found > nothing except for a comment > > that MPI_Barrier problems were often problems with fire walls. Also > mentioned as a problem > > was not having the same version of mpi on both machines. I turned > firewalls off and removed > > and reinstalled the same version on both hosts but I still see the same > thing. I then installed > > lam mpi on two of my machines and that works fine. I can call the > MPI_Barrier function when run on > > one of two machines by itself many times with no hangs. Only hangs if > two or more hosts are involved. > > These runs are all being done on CentOS release 6.4. Here is test > program I used. > > > > main (argc, argv) > > int argc; > > char **argv; > > { > > char message[20]; > > char hoster[256]; > > char nameis[256]; > > int fd, i, j, jnp, iret, myrank, np, ranker, recker; > > MPI_Comm comm; > > MPI_Status status; > > > > MPI_Init( &argc, &argv ); > > MPI_Comm_rank( MPI_COMM_WORLD, &myrank); > > MPI_Comm_size( MPI_COMM_WORLD, &np); > > > > gethostname(hoster,256); > > > > printf(" In rank %d and host= %s Do Barrier call > 1.\n",myrank,hoster); > > MPI_Barrier(MPI_COMM_WORLD); > > printf(" In rank %d and host= %s Do Barrier call > 2.\n",myrank,hoster); > > MPI_Barrier(MPI_COMM_WORLD); > > printf(" In rank %d and host= %s Do Barrier call > 3.\n",myrank,hoster); > > MPI_Barrier(MPI_COMM_WORLD); > > MPI_Finalize(); > > exit(0); > > } > > > > Here are three runs of test program. First with two processes on one > host, then with > > two processes on another host, and finally with one process on each of > two hosts. The > > first two runs are fine but the last run hangs on the second MPI_Barrier. > > > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out > > In rank 0 and host= centos Do Barrier call 1. > > In rank 1 and host= centos Do Barrier call 1. > > In rank 1 and host= centos Do Barrier call 2. > > In rank 1 and host= centos Do Barrier call 3. > > In rank 0 and host= centos Do Barrier call 2. > > In rank 0 and host= centos Do Barrier call 3. > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out > > /root/.bashrc: line 14: unalias: ls: not found > > In rank 0 and host= RAID Do Barrier call 1. > > In rank 0 and host= RAID Do Barrier call 2. > > In rank 0 and host= RAID Do Barrier call 3. > > In rank 1 and host= RAID Do Barrier call 1. > > In rank 1 and host= RAID Do Barrier call 2. > > In rank 1 and host= RAID Do Barrier call 3. > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos,RAID a.out > > /root/.bashrc: line 14: unalias: ls: not found > > In rank 0 and host= centos Do Barrier call 1. > > In rank 0 and host= centos Do Barrier call 2. > > In rank 1 and host= RAID Do Barrier call 1. > > In rank 1 and host= RAID Do Barrier call 2. > > > > Since it is such a simple test and problem and such a widely used MPI > function, it must obviously > > be an installation or configuration problem. A pstack for each of the > hung MPI_Barrier processes > > on the two machines shows this: > > > > [root@centos ~]# pstack 31666 > > #0 0x0000003baf0e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6 > > #1 0x00007f5de06125eb in epoll_dispatch () from > /usr/local/lib/libmpi.so.1 > > #2 0x00007f5de061475a in opal_event_base_loop () from > /usr/local/lib/libmpi.so.1 > > #3 0x00007f5de0639229 in opal_progress () from > /usr/local/lib/libmpi.so.1 > > #4 0x00007f5de0586f75 in ompi_request_default_wait_all () from > /usr/local/lib/libmpi.so.1 > > #5 0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from > /usr/local/lib/openmpi/mca_coll_tuned.so > > #6 0x00007f5ddc59d8ff in ompi_coll_tuned_barrier_intra_two_procs () > from /usr/local/lib/openmpi/mca_coll_tuned.so > > #7 0x00007f5de05941c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1 > > #8 0x0000000000400a43 in main () > > > > [root@RAID openmpi-1.6.5]# pstack 22167 > > #0 0x00000030302e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6 > > #1 0x00007f7ee46885eb in epoll_dispatch () from > /usr/local/lib/libmpi.so.1 > > #2 0x00007f7ee468a75a in opal_event_base_loop () from > /usr/local/lib/libmpi.so.1 > > #3 0x00007f7ee46af229 in opal_progress () from > /usr/local/lib/libmpi.so.1 > > #4 0x00007f7ee45fcf75 in ompi_request_default_wait_all () from > /usr/local/lib/libmpi.so.1 > > #5 0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from > /usr/local/lib/openmpi/mca_coll_tuned.so > > #6 0x00007f7ee06138ff in ompi_coll_tuned_barrier_intra_two_procs () > from /usr/local/lib/openmpi/mca_coll_tuned.so > > #7 0x00007f7ee460a1c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1 > > #8 0x0000000000400a43 in main () > > > > Which looks exactly the same on each machine. Any thoughts or ideas > would be greatly appreciated as > > I am stuck. > > > > Clay Kirkland > > > > > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > -------------- next part -------------- > HTML attachment scrubbed and removed > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ------------------------------ > > End of users Digest, Vol 2879, Issue 1 > ************************************** >