I am configuring with all defaults.   Just doing a ./configure and then
make and make install.   I have used open mpi on several kinds of
unix  systems this way and have had no trouble before.   I believe I
last had success on a redhat version of linux.


On Sat, May 3, 2014 at 11:00 AM, <users-requ...@open-mpi.org> wrote:

> Send users mailing list submissions to
>         us...@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
>         users-requ...@open-mpi.org
>
> You can reach the person managing the list at
>         users-ow...@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>    1. MPI_Barrier hangs on second attempt but only when multiple
>       hosts used. (Clay Kirkland)
>    2. Re: MPI_Barrier hangs on second attempt but only when
>       multiple hosts used. (Ralph Castain)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 2 May 2014 16:24:04 -0500
> From: Clay Kirkland <clay.kirkl...@versityinc.com>
> To: us...@open-mpi.org
> Subject: [OMPI users] MPI_Barrier hangs on second attempt but only
>         when    multiple hosts used.
> Message-ID:
>         <CAJDnjA8Wi=FEjz6Vz+Bc34b+nFE=
> tf4b7g0bqgmbekg7h-p...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> I have been using MPI for many many years so I have very well debugged mpi
> tests.   I am
> having trouble on either openmpi-1.4.5  or  openmpi-1.6.5 versions though
> with getting the
> MPI_Barrier calls to work.   It works fine when I run all processes on one
> machine but when
> I run with two or more hosts the second call to MPI_Barrier always hangs.
> Not the first one,
> but always the second one.   I looked at FAQ's and such but found nothing
> except for a comment
> that MPI_Barrier problems were often problems with fire walls.  Also
> mentioned as a problem
> was not having the same version of mpi on both machines.  I turned
> firewalls off and removed
> and reinstalled the same version on both hosts but I still see the same
> thing.   I then installed
> lam mpi on two of my machines and that works fine.   I can call the
> MPI_Barrier function when run on
> one of two machines by itself  many times with no hangs.  Only hangs if two
> or more hosts are involved.
> These runs are all being done on CentOS release 6.4.   Here is test program
> I used.
>
> main (argc, argv)
> int argc;
> char **argv;
> {
>     char message[20];
>     char hoster[256];
>     char nameis[256];
>     int fd, i, j, jnp, iret, myrank, np, ranker, recker;
>     MPI_Comm comm;
>     MPI_Status status;
>
>     MPI_Init( &argc, &argv );
>     MPI_Comm_rank( MPI_COMM_WORLD, &myrank);
>     MPI_Comm_size( MPI_COMM_WORLD, &np);
>
>         gethostname(hoster,256);
>
>         printf(" In rank %d and host= %s  Do Barrier call
> 1.\n",myrank,hoster);
>     MPI_Barrier(MPI_COMM_WORLD);
>         printf(" In rank %d and host= %s  Do Barrier call
> 2.\n",myrank,hoster);
>     MPI_Barrier(MPI_COMM_WORLD);
>         printf(" In rank %d and host= %s  Do Barrier call
> 3.\n",myrank,hoster);
>     MPI_Barrier(MPI_COMM_WORLD);
>     MPI_Finalize();
>     exit(0);
> }
>
>   Here are three runs of test program.  First with two processes on one
> host, then with
> two processes on another host, and finally with one process on each of two
> hosts.  The
> first two runs are fine but the last run hangs on the second MPI_Barrier.
>
> [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out
>  In rank 0 and host= centos  Do Barrier call 1.
>  In rank 1 and host= centos  Do Barrier call 1.
>  In rank 1 and host= centos  Do Barrier call 2.
>  In rank 1 and host= centos  Do Barrier call 3.
>  In rank 0 and host= centos  Do Barrier call 2.
>  In rank 0 and host= centos  Do Barrier call 3.
> [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out
> /root/.bashrc: line 14: unalias: ls: not found
>  In rank 0 and host= RAID  Do Barrier call 1.
>  In rank 0 and host= RAID  Do Barrier call 2.
>  In rank 0 and host= RAID  Do Barrier call 3.
>  In rank 1 and host= RAID  Do Barrier call 1.
>  In rank 1 and host= RAID  Do Barrier call 2.
>  In rank 1 and host= RAID  Do Barrier call 3.
> [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos,RAID a.out
> /root/.bashrc: line 14: unalias: ls: not found
>  In rank 0 and host= centos  Do Barrier call 1.
>  In rank 0 and host= centos  Do Barrier call 2.
> In rank 1 and host= RAID  Do Barrier call 1.
>  In rank 1 and host= RAID  Do Barrier call 2.
>
>   Since it is such a simple test and problem and such a widely used MPI
> function, it must obviously
> be an installation or configuration problem.   A pstack for each of the
> hung MPI_Barrier processes
> on the two machines shows this:
>
> [root@centos ~]# pstack 31666
> #0  0x0000003baf0e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> #1  0x00007f5de06125eb in epoll_dispatch () from /usr/local/lib/libmpi.so.1
> #2  0x00007f5de061475a in opal_event_base_loop () from
> /usr/local/lib/libmpi.so.1
> #3  0x00007f5de0639229 in opal_progress () from /usr/local/lib/libmpi.so.1
> #4  0x00007f5de0586f75 in ompi_request_default_wait_all () from
> /usr/local/lib/libmpi.so.1
> #5  0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from
> /usr/local/lib/openmpi/mca_coll_tuned.so
> #6  0x00007f5ddc59d8ff in ompi_coll_tuned_barrier_intra_two_procs () from
> /usr/local/lib/openmpi/mca_coll_tuned.so
> #7  0x00007f5de05941c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1
> #8  0x0000000000400a43 in main ()
>
> [root@RAID openmpi-1.6.5]# pstack 22167
> #0  0x00000030302e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> #1  0x00007f7ee46885eb in epoll_dispatch () from /usr/local/lib/libmpi.so.1
> #2  0x00007f7ee468a75a in opal_event_base_loop () from
> /usr/local/lib/libmpi.so.1
> #3  0x00007f7ee46af229 in opal_progress () from /usr/local/lib/libmpi.so.1
> #4  0x00007f7ee45fcf75 in ompi_request_default_wait_all () from
> /usr/local/lib/libmpi.so.1
> #5  0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from
> /usr/local/lib/openmpi/mca_coll_tuned.so
> #6  0x00007f7ee06138ff in ompi_coll_tuned_barrier_intra_two_procs () from
> /usr/local/lib/openmpi/mca_coll_tuned.so
> #7  0x00007f7ee460a1c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1
> #8  0x0000000000400a43 in main ()
>
>  Which looks exactly the same on each machine.  Any thoughts or ideas would
> be greatly appreciated as
> I am stuck.
>
>  Clay Kirkland
> -------------- next part --------------
> HTML attachment scrubbed and removed
>
> ------------------------------
>
> Message: 2
> Date: Sat, 3 May 2014 06:39:20 -0700
> From: Ralph Castain <r...@open-mpi.org>
> To: Open MPI Users <us...@open-mpi.org>
> Subject: Re: [OMPI users] MPI_Barrier hangs on second attempt but only
>         when    multiple hosts used.
> Message-ID: <3cf53d73-15d9-40bb-a2de-50ba3561a...@open-mpi.org>
> Content-Type: text/plain; charset="us-ascii"
>
> Hmmm...just testing on my little cluster here on two nodes, it works just
> fine with 1.8.2:
>
> [rhc@bend001 v1.8]$ mpirun -n 2 --map-by node ./a.out
>  In rank 0 and host= bend001  Do Barrier call 1.
>  In rank 0 and host= bend001  Do Barrier call 2.
>  In rank 0 and host= bend001  Do Barrier call 3.
>  In rank 1 and host= bend002  Do Barrier call 1.
>  In rank 1 and host= bend002  Do Barrier call 2.
>  In rank 1 and host= bend002  Do Barrier call 3.
> [rhc@bend001 v1.8]$
>
>
> How are you configuring OMPI?
>
>
> On May 2, 2014, at 2:24 PM, Clay Kirkland <clay.kirkl...@versityinc.com>
> wrote:
>
> > I have been using MPI for many many years so I have very well debugged
> mpi tests.   I am
> > having trouble on either openmpi-1.4.5  or  openmpi-1.6.5 versions
> though with getting the
> > MPI_Barrier calls to work.   It works fine when I run all processes on
> one machine but when
> > I run with two or more hosts the second call to MPI_Barrier always
> hangs.   Not the first one,
> > but always the second one.   I looked at FAQ's and such but found
> nothing except for a comment
> > that MPI_Barrier problems were often problems with fire walls.  Also
> mentioned as a problem
> > was not having the same version of mpi on both machines.  I turned
> firewalls off and removed
> > and reinstalled the same version on both hosts but I still see the same
> thing.   I then installed
> > lam mpi on two of my machines and that works fine.   I can call the
> MPI_Barrier function when run on
> > one of two machines by itself  many times with no hangs.  Only hangs if
> two or more hosts are involved.
> > These runs are all being done on CentOS release 6.4.   Here is test
> program I used.
> >
> > main (argc, argv)
> > int argc;
> > char **argv;
> > {
> >     char message[20];
> >     char hoster[256];
> >     char nameis[256];
> >     int fd, i, j, jnp, iret, myrank, np, ranker, recker;
> >     MPI_Comm comm;
> >     MPI_Status status;
> >
> >     MPI_Init( &argc, &argv );
> >     MPI_Comm_rank( MPI_COMM_WORLD, &myrank);
> >     MPI_Comm_size( MPI_COMM_WORLD, &np);
> >
> >         gethostname(hoster,256);
> >
> >         printf(" In rank %d and host= %s  Do Barrier call
> 1.\n",myrank,hoster);
> >     MPI_Barrier(MPI_COMM_WORLD);
> >         printf(" In rank %d and host= %s  Do Barrier call
> 2.\n",myrank,hoster);
> >     MPI_Barrier(MPI_COMM_WORLD);
> >         printf(" In rank %d and host= %s  Do Barrier call
> 3.\n",myrank,hoster);
> >     MPI_Barrier(MPI_COMM_WORLD);
> >     MPI_Finalize();
> >     exit(0);
> > }
> >
> >   Here are three runs of test program.  First with two processes on one
> host, then with
> > two processes on another host, and finally with one process on each of
> two hosts.  The
> > first two runs are fine but the last run hangs on the second MPI_Barrier.
> >
> > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out
> >  In rank 0 and host= centos  Do Barrier call 1.
> >  In rank 1 and host= centos  Do Barrier call 1.
> >  In rank 1 and host= centos  Do Barrier call 2.
> >  In rank 1 and host= centos  Do Barrier call 3.
> >  In rank 0 and host= centos  Do Barrier call 2.
> >  In rank 0 and host= centos  Do Barrier call 3.
> > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out
> > /root/.bashrc: line 14: unalias: ls: not found
> >  In rank 0 and host= RAID  Do Barrier call 1.
> >  In rank 0 and host= RAID  Do Barrier call 2.
> >  In rank 0 and host= RAID  Do Barrier call 3.
> >  In rank 1 and host= RAID  Do Barrier call 1.
> >  In rank 1 and host= RAID  Do Barrier call 2.
> >  In rank 1 and host= RAID  Do Barrier call 3.
> > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos,RAID a.out
> > /root/.bashrc: line 14: unalias: ls: not found
> >  In rank 0 and host= centos  Do Barrier call 1.
> >  In rank 0 and host= centos  Do Barrier call 2.
> > In rank 1 and host= RAID  Do Barrier call 1.
> >  In rank 1 and host= RAID  Do Barrier call 2.
> >
> >   Since it is such a simple test and problem and such a widely used MPI
> function, it must obviously
> > be an installation or configuration problem.   A pstack for each of the
> hung MPI_Barrier processes
> > on the two machines shows this:
> >
> > [root@centos ~]# pstack 31666
> > #0  0x0000003baf0e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> > #1  0x00007f5de06125eb in epoll_dispatch () from
> /usr/local/lib/libmpi.so.1
> > #2  0x00007f5de061475a in opal_event_base_loop () from
> /usr/local/lib/libmpi.so.1
> > #3  0x00007f5de0639229 in opal_progress () from
> /usr/local/lib/libmpi.so.1
> > #4  0x00007f5de0586f75 in ompi_request_default_wait_all () from
> /usr/local/lib/libmpi.so.1
> > #5  0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from
> /usr/local/lib/openmpi/mca_coll_tuned.so
> > #6  0x00007f5ddc59d8ff in ompi_coll_tuned_barrier_intra_two_procs ()
> from /usr/local/lib/openmpi/mca_coll_tuned.so
> > #7  0x00007f5de05941c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1
> > #8  0x0000000000400a43 in main ()
> >
> > [root@RAID openmpi-1.6.5]# pstack 22167
> > #0  0x00000030302e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> > #1  0x00007f7ee46885eb in epoll_dispatch () from
> /usr/local/lib/libmpi.so.1
> > #2  0x00007f7ee468a75a in opal_event_base_loop () from
> /usr/local/lib/libmpi.so.1
> > #3  0x00007f7ee46af229 in opal_progress () from
> /usr/local/lib/libmpi.so.1
> > #4  0x00007f7ee45fcf75 in ompi_request_default_wait_all () from
> /usr/local/lib/libmpi.so.1
> > #5  0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from
> /usr/local/lib/openmpi/mca_coll_tuned.so
> > #6  0x00007f7ee06138ff in ompi_coll_tuned_barrier_intra_two_procs ()
> from /usr/local/lib/openmpi/mca_coll_tuned.so
> > #7  0x00007f7ee460a1c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1
> > #8  0x0000000000400a43 in main ()
> >
> >  Which looks exactly the same on each machine.  Any thoughts or ideas
> would be greatly appreciated as
> > I am stuck.
> >
> >  Clay Kirkland
> >
> >
> >
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> -------------- next part --------------
> HTML attachment scrubbed and removed
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ------------------------------
>
> End of users Digest, Vol 2879, Issue 1
> **************************************
>

Reply via email to