That last trick seems to work. I can get it to work once in a while with those tcp options but it is tricky as I have three machines and two of them use eth0 as primary network interface and one uses eth1. But by fiddling with network options and perhaps moving a cable or two I think I can get it all to work Thanks much for the tip.
Clay On Tue, May 6, 2014 at 11:00 AM, <users-requ...@open-mpi.org> wrote: > Send users mailing list submissions to > us...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.open-mpi.org/mailman/listinfo.cgi/users > or, via email, send a message with subject or body 'help' to > users-requ...@open-mpi.org > > You can reach the person managing the list at > users-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of users digest..." > > > Today's Topics: > > 1. Re: MPI_Barrier hangs on second attempt but only when > multiple hosts used. (Daniels, Marcus G) > 2. ROMIO bug reading darrays (Richard Shaw) > 3. MPI File Open does not work (Imran Ali) > 4. Re: MPI File Open does not work (Jeff Squyres (jsquyres)) > 5. Re: MPI File Open does not work (Imran Ali) > 6. Re: MPI File Open does not work (Jeff Squyres (jsquyres)) > 7. Re: MPI File Open does not work (Imran Ali) > 8. Re: MPI File Open does not work (Jeff Squyres (jsquyres)) > 9. Re: users Digest, Vol 2879, Issue 1 (Jeff Squyres (jsquyres)) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 5 May 2014 19:28:07 +0000 > From: "Daniels, Marcus G" <mdani...@lanl.gov> > To: "'us...@open-mpi.org'" <us...@open-mpi.org> > Subject: Re: [OMPI users] MPI_Barrier hangs on second attempt but only > when multiple hosts used. > Message-ID: > < > 532c594b7920a549a2a91cb4312cc57640dc5...@ecs-exg-p-mb01.win.lanl.gov> > Content-Type: text/plain; charset="utf-8" > > > > From: Clay Kirkland [mailto:clay.kirkl...@versityinc.com] > Sent: Friday, May 02, 2014 03:24 PM > To: us...@open-mpi.org <us...@open-mpi.org> > Subject: [OMPI users] MPI_Barrier hangs on second attempt but only when > multiple hosts used. > > I have been using MPI for many many years so I have very well debugged mpi > tests. I am > having trouble on either openmpi-1.4.5 or openmpi-1.6.5 versions though > with getting the > MPI_Barrier calls to work. It works fine when I run all processes on one > machine but when > I run with two or more hosts the second call to MPI_Barrier always hangs. > Not the first one, > but always the second one. I looked at FAQ's and such but found nothing > except for a comment > that MPI_Barrier problems were often problems with fire walls. Also > mentioned as a problem > was not having the same version of mpi on both machines. I turned > firewalls off and removed > and reinstalled the same version on both hosts but I still see the same > thing. I then installed > lam mpi on two of my machines and that works fine. I can call the > MPI_Barrier function when run on > one of two machines by itself many times with no hangs. Only hangs if > two or more hosts are involved. > These runs are all being done on CentOS release 6.4. Here is test > program I used. > > main (argc, argv) > int argc; > char **argv; > { > char message[20]; > char hoster[256]; > char nameis[256]; > int fd, i, j, jnp, iret, myrank, np, ranker, recker; > MPI_Comm comm; > MPI_Status status; > > MPI_Init( &argc, &argv ); > MPI_Comm_rank( MPI_COMM_WORLD, &myrank); > MPI_Comm_size( MPI_COMM_WORLD, &np); > > gethostname(hoster,256); > > printf(" In rank %d and host= %s Do Barrier call > 1.\n",myrank,hoster); > MPI_Barrier(MPI_COMM_WORLD); > printf(" In rank %d and host= %s Do Barrier call > 2.\n",myrank,hoster); > MPI_Barrier(MPI_COMM_WORLD); > printf(" In rank %d and host= %s Do Barrier call > 3.\n",myrank,hoster); > MPI_Barrier(MPI_COMM_WORLD); > MPI_Finalize(); > exit(0); > } > > Here are three runs of test program. First with two processes on one > host, then with > two processes on another host, and finally with one process on each of two > hosts. The > first two runs are fine but the last run hangs on the second MPI_Barrier. > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out > In rank 0 and host= centos Do Barrier call 1. > In rank 1 and host= centos Do Barrier call 1. > In rank 1 and host= centos Do Barrier call 2. > In rank 1 and host= centos Do Barrier call 3. > In rank 0 and host= centos Do Barrier call 2. > In rank 0 and host= centos Do Barrier call 3. > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out > /root/.bashrc: line 14: unalias: ls: not found > In rank 0 and host= RAID Do Barrier call 1. > In rank 0 and host= RAID Do Barrier call 2. > In rank 0 and host= RAID Do Barrier call 3. > In rank 1 and host= RAID Do Barrier call 1. > In rank 1 and host= RAID Do Barrier call 2. > In rank 1 and host= RAID Do Barrier call 3. > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos,RAID a.out > /root/.bashrc: line 14: unalias: ls: not found > In rank 0 and host= centos Do Barrier call 1. > In rank 0 and host= centos Do Barrier call 2. > In rank 1 and host= RAID Do Barrier call 1. > In rank 1 and host= RAID Do Barrier call 2. > > Since it is such a simple test and problem and such a widely used MPI > function, it must obviously > be an installation or configuration problem. A pstack for each of the > hung MPI_Barrier processes > on the two machines shows this: > > [root@centos ~]# pstack 31666 > #0 0x0000003baf0e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6 > #1 0x00007f5de06125eb in epoll_dispatch () from /usr/local/lib/libmpi.so.1 > #2 0x00007f5de061475a in opal_event_base_loop () from > /usr/local/lib/libmpi.so.1 > #3 0x00007f5de0639229 in opal_progress () from /usr/local/lib/libmpi.so.1 > #4 0x00007f5de0586f75 in ompi_request_default_wait_all () from > /usr/local/lib/libmpi.so.1 > #5 0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from > /usr/local/lib/openmpi/mca_coll_tuned.so > #6 0x00007f5ddc59d8ff in ompi_coll_tuned_barrier_intra_two_procs () from > /usr/local/lib/openmpi/mca_coll_tuned.so > #7 0x00007f5de05941c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1 > #8 0x0000000000400a43 in main () > > [root@RAID openmpi-1.6.5]# pstack 22167 > #0 0x00000030302e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6 > #1 0x00007f7ee46885eb in epoll_dispatch () from /usr/local/lib/libmpi.so.1 > #2 0x00007f7ee468a75a in opal_event_base_loop () from > /usr/local/lib/libmpi.so.1 > #3 0x00007f7ee46af229 in opal_progress () from /usr/local/lib/libmpi.so.1 > #4 0x00007f7ee45fcf75 in ompi_request_default_wait_all () from > /usr/local/lib/libmpi.so.1 > #5 0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from > /usr/local/lib/openmpi/mca_coll_tuned.so > #6 0x00007f7ee06138ff in ompi_coll_tuned_barrier_intra_two_procs () from > /usr/local/lib/openmpi/mca_coll_tuned.so > #7 0x00007f7ee460a1c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1 > #8 0x0000000000400a43 in main () > > Which looks exactly the same on each machine. Any thoughts or ideas > would be greatly appreciated as > I am stuck. > > Clay Kirkland > > > > > > > > > -------------- next part -------------- > HTML attachment scrubbed and removed > > ------------------------------ > > Message: 2 > Date: Mon, 5 May 2014 22:20:59 -0400 > From: Richard Shaw <jr...@cita.utoronto.ca> > To: Open MPI Users <us...@open-mpi.org> > Subject: [OMPI users] ROMIO bug reading darrays > Message-ID: > < > can+evmkc+9kacnpausscziufwdj3jfcsymb-8zdx1etdkab...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Hello, > > I think I've come across a bug when using ROMIO to read in a 2D distributed > array. I've attached a test case to this email. > > In the testcase I first initialise an array of 25 doubles (which will be a > 5x5 grid), then I create a datatype representing a 5x5 matrix distributed > in 3x3 blocks over a 2x2 process grid. As a reference I use MPI_Pack to > pull out the block cyclic array elements local to each process (which I > think is correct). Then I write the original array of 25 doubles to disk, > and use MPI-IO to read it back in (performing the Open, Set_view, and > Real_all), and compare to the reference. > > Running this with OMPI, the two match on all ranks. > > > mpirun -mca io ompio -np 4 ./darr_read.x > === Rank 0 === (9 elements) > Packed: 0.0 1.0 2.0 5.0 6.0 7.0 10.0 11.0 12.0 > Read: 0.0 1.0 2.0 5.0 6.0 7.0 10.0 11.0 12.0 > > === Rank 1 === (6 elements) > Packed: 15.0 16.0 17.0 20.0 21.0 22.0 > Read: 15.0 16.0 17.0 20.0 21.0 22.0 > > === Rank 2 === (6 elements) > Packed: 3.0 4.0 8.0 9.0 13.0 14.0 > Read: 3.0 4.0 8.0 9.0 13.0 14.0 > > === Rank 3 === (4 elements) > Packed: 18.0 19.0 23.0 24.0 > Read: 18.0 19.0 23.0 24.0 > > > > However, using ROMIO the two differ on two of the ranks: > > > mpirun -mca io romio -np 4 ./darr_read.x > === Rank 0 === (9 elements) > Packed: 0.0 1.0 2.0 5.0 6.0 7.0 10.0 11.0 12.0 > Read: 0.0 1.0 2.0 5.0 6.0 7.0 10.0 11.0 12.0 > > === Rank 1 === (6 elements) > Packed: 15.0 16.0 17.0 20.0 21.0 22.0 > Read: 0.0 1.0 2.0 0.0 1.0 2.0 > > === Rank 2 === (6 elements) > Packed: 3.0 4.0 8.0 9.0 13.0 14.0 > Read: 3.0 4.0 8.0 9.0 13.0 14.0 > > === Rank 3 === (4 elements) > Packed: 18.0 19.0 23.0 24.0 > Read: 0.0 1.0 0.0 1.0 > > > > My interpretation is that the behaviour with OMPIO is correct. > Interestingly everything matches up using both ROMIO and OMPIO if I set the > block shape to 2x2. > > This was run on OS X using 1.8.2a1r31632. I have also run this on Linux > with OpenMPI 1.7.4, and OMPIO is still correct, but using ROMIO I just get > segfaults. > > Thanks, > Richard > -------------- next part -------------- > HTML attachment scrubbed and removed > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: darr_read.c > Type: text/x-csrc > Size: 2218 bytes > Desc: not available > URL: < > http://www.open-mpi.org/MailArchives/users/attachments/20140505/5a5ab0ba/attachment.bin > > > > ------------------------------ > > Message: 3 > Date: Tue, 06 May 2014 13:24:35 +0200 > From: Imran Ali <imra...@student.matnat.uio.no> > To: <us...@open-mpi.org> > Subject: [OMPI users] MPI File Open does not work > Message-ID: <d57bdf499c00360b737205b085c50...@ulrik.uio.no> > Content-Type: text/plain; charset="utf-8" > > > > I get the following error when I try to run the following python > code > import mpi4py.MPI as MPI > comm = MPI.COMM_WORLD > > MPI.File.Open(comm,"some.file") > > $ mpirun -np 1 python > test_mpi.py > Traceback (most recent call last): > File "test_mpi.py", line > 3, in <module> > MPI.File.Open(comm," h5ex_d_alloc.h5") > File "File.pyx", > line 67, in mpi4py.MPI.File.Open > (src/mpi4py.MPI.c:89639) > mpi4py.MPI.Exception: MPI_ERR_OTHER: known > error not in > list > -------------------------------------------------------------------------- > mpirun > noticed that the job aborted, but has no info as to the process > that > caused that > situation. > -------------------------------------------------------------------------- > > > My mpirun version is (Open MPI) 1.6.2. I installed openmpi using the > dorsal script (https://github.com/FEniCS/dorsal) for Redhat Enterprise 6 > (OS I am using, release 6.5) . It configured the build as following : > > > ./configure --enable-mpi-thread-multiple --enable-opal-multi-threads > --with-threads=posix --disable-mpi-profile > > I need emphasize that I do > not have root acces on the system I am running my application. > > Imran > > > > -------------- next part -------------- > HTML attachment scrubbed and removed > > ------------------------------ > > Message: 4 > Date: Tue, 6 May 2014 12:56:04 +0000 > From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > To: Open MPI Users <us...@open-mpi.org> > Subject: Re: [OMPI users] MPI File Open does not work > Message-ID: <e7df28cb-d4fb-4087-928e-18e61d1d2...@cisco.com> > Content-Type: text/plain; charset="us-ascii" > > The thread support in the 1.6 series is not very good. You might try: > > - Upgrading to 1.6.5 > - Or better yet, upgrading to 1.8.1 > > > On May 6, 2014, at 7:24 AM, Imran Ali <imra...@student.matnat.uio.no> > wrote: > > > I get the following error when I try to run the following python code > > > > import mpi4py.MPI as MPI > > comm = MPI.COMM_WORLD > > MPI.File.Open(comm,"some.file") > > > > $ mpirun -np 1 python test_mpi.py > > Traceback (most recent call last): > > File "test_mpi.py", line 3, in <module> > > MPI.File.Open(comm," h5ex_d_alloc.h5") > > File "File.pyx", line 67, in mpi4py.MPI.File.Open > (src/mpi4py.MPI.c:89639) > > mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list > > > -------------------------------------------------------------------------- > > mpirun noticed that the job aborted, but has no info as to the process > > that caused that situation. > > > -------------------------------------------------------------------------- > > > > My mpirun version is (Open MPI) 1.6.2. I installed openmpi using the > dorsal script (https://github.com/FEniCS/dorsal) for Redhat Enterprise 6 > (OS I am using, release 6.5) . It configured the build as following : > > > > ./configure --enable-mpi-thread-multiple --enable-opal-multi-threads > --with-threads=posix --disable-mpi-profile > > > > I need emphasize that I do not have root acces on the system I am > running my application. > > > > Imran > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > ------------------------------ > > Message: 5 > Date: Tue, 6 May 2014 15:32:21 +0200 > From: Imran Ali <imra...@student.matnat.uio.no> > To: Open MPI Users <us...@open-mpi.org> > Subject: Re: [OMPI users] MPI File Open does not work > Message-ID: <fa6dffff-6c66-4a47-84fc-148fb51ce...@math.uio.no> > Content-Type: text/plain; charset=us-ascii > > > 6. mai 2014 kl. 14:56 skrev Jeff Squyres (jsquyres) <jsquy...@cisco.com>: > > > The thread support in the 1.6 series is not very good. You might try: > > > > - Upgrading to 1.6.5 > > - Or better yet, upgrading to 1.8.1 > > > > I will attempt that than. I read at > > http://www.open-mpi.org/faq/?category=building#install-overwrite > > that I should completely uninstall my previous version. Could you > recommend to me how I can go about doing it (without root access). > I am uncertain where I can use make uninstall. > > Imran > > > > > On May 6, 2014, at 7:24 AM, Imran Ali <imra...@student.matnat.uio.no> > wrote: > > > >> I get the following error when I try to run the following python code > >> > >> import mpi4py.MPI as MPI > >> comm = MPI.COMM_WORLD > >> MPI.File.Open(comm,"some.file") > >> > >> $ mpirun -np 1 python test_mpi.py > >> Traceback (most recent call last): > >> File "test_mpi.py", line 3, in <module> > >> MPI.File.Open(comm," h5ex_d_alloc.h5") > >> File "File.pyx", line 67, in mpi4py.MPI.File.Open > (src/mpi4py.MPI.c:89639) > >> mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list > >> > -------------------------------------------------------------------------- > >> mpirun noticed that the job aborted, but has no info as to the process > >> that caused that situation. > >> > -------------------------------------------------------------------------- > >> > >> My mpirun version is (Open MPI) 1.6.2. I installed openmpi using the > dorsal script (https://github.com/FEniCS/dorsal) for Redhat Enterprise 6 > (OS I am using, release 6.5) . It configured the build as following : > >> > >> ./configure --enable-mpi-thread-multiple --enable-opal-multi-threads > --with-threads=posix --disable-mpi-profile > >> > >> I need emphasize that I do not have root acces on the system I am > running my application. > >> > >> Imran > >> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ------------------------------ > > Message: 6 > Date: Tue, 6 May 2014 13:34:38 +0000 > From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > To: Open MPI Users <us...@open-mpi.org> > Subject: Re: [OMPI users] MPI File Open does not work > Message-ID: <2a933c0e-80f6-4ded-b44c-53b5f37ef...@cisco.com> > Content-Type: text/plain; charset="us-ascii" > > On May 6, 2014, at 9:32 AM, Imran Ali <imra...@student.matnat.uio.no> > wrote: > > > I will attempt that than. I read at > > > > http://www.open-mpi.org/faq/?category=building#install-overwrite > > > > that I should completely uninstall my previous version. > > Yes, that is best. OR: you can install into a whole separate tree and > ignore the first installation. > > > Could you recommend to me how I can go about doing it (without root > access). > > I am uncertain where I can use make uninstall. > > If you don't have write access into the installation tree (i.e., it was > installed via root and you don't have root access), then your best bet is > simply to install into a new tree. E.g., if OMPI is installed into > /opt/openmpi-1.6.2, try installing into /opt/openmpi-1.6.5, or even > $HOME/installs/openmpi-1.6.5, or something like that. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > ------------------------------ > > Message: 7 > Date: Tue, 6 May 2014 15:40:34 +0200 > From: Imran Ali <imra...@student.matnat.uio.no> > To: Open MPI Users <us...@open-mpi.org> > Subject: Re: [OMPI users] MPI File Open does not work > Message-ID: <14f0596c-c5c5-4b1a-a4a8-8849d44ab...@math.uio.no> > Content-Type: text/plain; charset=us-ascii > > > 6. mai 2014 kl. 15:34 skrev Jeff Squyres (jsquyres) <jsquy...@cisco.com>: > > > On May 6, 2014, at 9:32 AM, Imran Ali <imra...@student.matnat.uio.no> > wrote: > > > >> I will attempt that than. I read at > >> > >> http://www.open-mpi.org/faq/?category=building#install-overwrite > >> > >> that I should completely uninstall my previous version. > > > > Yes, that is best. OR: you can install into a whole separate tree and > ignore the first installation. > > > >> Could you recommend to me how I can go about doing it (without root > access). > >> I am uncertain where I can use make uninstall. > > > > If you don't have write access into the installation tree (i.e., it was > installed via root and you don't have root access), then your best bet is > simply to install into a new tree. E.g., if OMPI is installed into > /opt/openmpi-1.6.2, try installing into /opt/openmpi-1.6.5, or even > $HOME/installs/openmpi-1.6.5, or something like that. > > My install was in my user directory (i.e $HOME). I managed to locate the > source directory and successfully run make uninstall. > > Will let you know how things went after installation. > > Imran > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ------------------------------ > > Message: 8 > Date: Tue, 6 May 2014 14:42:52 +0000 > From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > To: Open MPI Users <us...@open-mpi.org> > Subject: Re: [OMPI users] MPI File Open does not work > Message-ID: <710e3328-edaa-4a13-9f07-b45fe3191...@cisco.com> > Content-Type: text/plain; charset="us-ascii" > > On May 6, 2014, at 9:40 AM, Imran Ali <imra...@student.matnat.uio.no> > wrote: > > > My install was in my user directory (i.e $HOME). I managed to locate the > source directory and successfully run make uninstall. > > > FWIW, I usually install Open MPI into its own subdir. E.g., > $HOME/installs/openmpi-x.y.z. Then if I don't want that install any more, > I can just "rm -rf $HOME/installs/openmpi-x.y.z" -- no need to "make > uninstall". Specifically: if there's nothing else installed in the same > tree as Open MPI, you can just rm -rf its installation tree. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > ------------------------------ > > Message: 9 > Date: Tue, 6 May 2014 14:50:34 +0000 > From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > To: Open MPI Users <us...@open-mpi.org> > Subject: Re: [OMPI users] users Digest, Vol 2879, Issue 1 > Message-ID: <c60aa7e1-96a7-4ccd-9b5b-11a38fb87...@cisco.com> > Content-Type: text/plain; charset="us-ascii" > > Are you using TCP as the MPI transport? > > If so, another thing to try is to limit the IP interfaces that MPI uses > for its traffic to see if there's some kind of problem with specific > networks. > > For example: > > mpirun --mca btl_tcp_if_include eth0 ... > > If that works, then try adding in any/all other IP interfaces that you > have on your machines. > > A sorta-wild guess: you have some IP interfaces that aren't working, or at > least, don't work in the way that OMPI wants them to work. So the first > barrier works because it flows across eth0 (or some other first network > that, as far as OMPI is concerned, works just fine). But then the next > barrier round-robin advances to the next IP interface, and it doesn't work > for some reason. > > We've seen virtual machine bridge interfaces cause problems, for example. > E.g., if a machine has a Xen virtual machine interface (vibr0, IIRC?), > then OMPI will try to use it to communicate with peer MPI processes because > it has a "compatible" IP address, and OMPI think it should be > connected/reachable to peers. If this is the case, you might want to > disable such interfaces and/or use btl_tcp_if_include or btl_tcp_if_exclude > to select the interfaces that you want to use. > > Pro tip: if you use btl_tcp_if_exclude, remember to exclude the loopback > interface, too. OMPI defaults to a btl_tcp_if_include="" (blank) and > btl_tcp_if_exclude="lo0". So if you override btl_tcp_if_exclude, you need > to be sure to *also* include lo0 in the new value. For example: > > mpirun --mca btl_tcp_if_exclude lo0,virb0 ... > > Also, if possible, try upgrading to Open MPI 1.8.1. > > > > On May 4, 2014, at 2:15 PM, Clay Kirkland <clay.kirkl...@versityinc.com> > wrote: > > > I am configuring with all defaults. Just doing a ./configure and then > > make and make install. I have used open mpi on several kinds of > > unix systems this way and have had no trouble before. I believe I > > last had success on a redhat version of linux. > > > > > > On Sat, May 3, 2014 at 11:00 AM, <users-requ...@open-mpi.org> wrote: > > Send users mailing list submissions to > > us...@open-mpi.org > > > > To subscribe or unsubscribe via the World Wide Web, visit > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > or, via email, send a message with subject or body 'help' to > > users-requ...@open-mpi.org > > > > You can reach the person managing the list at > > users-ow...@open-mpi.org > > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of users digest..." > > > > > > Today's Topics: > > > > 1. MPI_Barrier hangs on second attempt but only when multiple > > hosts used. (Clay Kirkland) > > 2. Re: MPI_Barrier hangs on second attempt but only when > > multiple hosts used. (Ralph Castain) > > > > > > ---------------------------------------------------------------------- > > > > Message: 1 > > Date: Fri, 2 May 2014 16:24:04 -0500 > > From: Clay Kirkland <clay.kirkl...@versityinc.com> > > To: us...@open-mpi.org > > Subject: [OMPI users] MPI_Barrier hangs on second attempt but only > > when multiple hosts used. > > Message-ID: > > <CAJDnjA8Wi=FEjz6Vz+Bc34b+nFE= > tf4b7g0bqgmbekg7h-p...@mail.gmail.com> > > Content-Type: text/plain; charset="utf-8" > > > > I have been using MPI for many many years so I have very well debugged > mpi > > tests. I am > > having trouble on either openmpi-1.4.5 or openmpi-1.6.5 versions though > > with getting the > > MPI_Barrier calls to work. It works fine when I run all processes on > one > > machine but when > > I run with two or more hosts the second call to MPI_Barrier always hangs. > > Not the first one, > > but always the second one. I looked at FAQ's and such but found nothing > > except for a comment > > that MPI_Barrier problems were often problems with fire walls. Also > > mentioned as a problem > > was not having the same version of mpi on both machines. I turned > > firewalls off and removed > > and reinstalled the same version on both hosts but I still see the same > > thing. I then installed > > lam mpi on two of my machines and that works fine. I can call the > > MPI_Barrier function when run on > > one of two machines by itself many times with no hangs. Only hangs if > two > > or more hosts are involved. > > These runs are all being done on CentOS release 6.4. Here is test > program > > I used. > > > > main (argc, argv) > > int argc; > > char **argv; > > { > > char message[20]; > > char hoster[256]; > > char nameis[256]; > > int fd, i, j, jnp, iret, myrank, np, ranker, recker; > > MPI_Comm comm; > > MPI_Status status; > > > > MPI_Init( &argc, &argv ); > > MPI_Comm_rank( MPI_COMM_WORLD, &myrank); > > MPI_Comm_size( MPI_COMM_WORLD, &np); > > > > gethostname(hoster,256); > > > > printf(" In rank %d and host= %s Do Barrier call > > 1.\n",myrank,hoster); > > MPI_Barrier(MPI_COMM_WORLD); > > printf(" In rank %d and host= %s Do Barrier call > > 2.\n",myrank,hoster); > > MPI_Barrier(MPI_COMM_WORLD); > > printf(" In rank %d and host= %s Do Barrier call > > 3.\n",myrank,hoster); > > MPI_Barrier(MPI_COMM_WORLD); > > MPI_Finalize(); > > exit(0); > > } > > > > Here are three runs of test program. First with two processes on one > > host, then with > > two processes on another host, and finally with one process on each of > two > > hosts. The > > first two runs are fine but the last run hangs on the second MPI_Barrier. > > > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out > > In rank 0 and host= centos Do Barrier call 1. > > In rank 1 and host= centos Do Barrier call 1. > > In rank 1 and host= centos Do Barrier call 2. > > In rank 1 and host= centos Do Barrier call 3. > > In rank 0 and host= centos Do Barrier call 2. > > In rank 0 and host= centos Do Barrier call 3. > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out > > /root/.bashrc: line 14: unalias: ls: not found > > In rank 0 and host= RAID Do Barrier call 1. > > In rank 0 and host= RAID Do Barrier call 2. > > In rank 0 and host= RAID Do Barrier call 3. > > In rank 1 and host= RAID Do Barrier call 1. > > In rank 1 and host= RAID Do Barrier call 2. > > In rank 1 and host= RAID Do Barrier call 3. > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos,RAID a.out > > /root/.bashrc: line 14: unalias: ls: not found > > In rank 0 and host= centos Do Barrier call 1. > > In rank 0 and host= centos Do Barrier call 2. > > In rank 1 and host= RAID Do Barrier call 1. > > In rank 1 and host= RAID Do Barrier call 2. > > > > Since it is such a simple test and problem and such a widely used MPI > > function, it must obviously > > be an installation or configuration problem. A pstack for each of the > > hung MPI_Barrier processes > > on the two machines shows this: > > > > [root@centos ~]# pstack 31666 > > #0 0x0000003baf0e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6 > > #1 0x00007f5de06125eb in epoll_dispatch () from > /usr/local/lib/libmpi.so.1 > > #2 0x00007f5de061475a in opal_event_base_loop () from > > /usr/local/lib/libmpi.so.1 > > #3 0x00007f5de0639229 in opal_progress () from > /usr/local/lib/libmpi.so.1 > > #4 0x00007f5de0586f75 in ompi_request_default_wait_all () from > > /usr/local/lib/libmpi.so.1 > > #5 0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from > > /usr/local/lib/openmpi/mca_coll_tuned.so > > #6 0x00007f5ddc59d8ff in ompi_coll_tuned_barrier_intra_two_procs () from > > /usr/local/lib/openmpi/mca_coll_tuned.so > > #7 0x00007f5de05941c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1 > > #8 0x0000000000400a43 in main () > > > > [root@RAID openmpi-1.6.5]# pstack 22167 > > #0 0x00000030302e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6 > > #1 0x00007f7ee46885eb in epoll_dispatch () from > /usr/local/lib/libmpi.so.1 > > #2 0x00007f7ee468a75a in opal_event_base_loop () from > > /usr/local/lib/libmpi.so.1 > > #3 0x00007f7ee46af229 in opal_progress () from > /usr/local/lib/libmpi.so.1 > > #4 0x00007f7ee45fcf75 in ompi_request_default_wait_all () from > > /usr/local/lib/libmpi.so.1 > > #5 0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from > > /usr/local/lib/openmpi/mca_coll_tuned.so > > #6 0x00007f7ee06138ff in ompi_coll_tuned_barrier_intra_two_procs () from > > /usr/local/lib/openmpi/mca_coll_tuned.so > > #7 0x00007f7ee460a1c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1 > > #8 0x0000000000400a43 in main () > > > > Which looks exactly the same on each machine. Any thoughts or ideas > would > > be greatly appreciated as > > I am stuck. > > > > Clay Kirkland > > -------------- next part -------------- > > HTML attachment scrubbed and removed > > > > ------------------------------ > > > > Message: 2 > > Date: Sat, 3 May 2014 06:39:20 -0700 > > From: Ralph Castain <r...@open-mpi.org> > > To: Open MPI Users <us...@open-mpi.org> > > Subject: Re: [OMPI users] MPI_Barrier hangs on second attempt but only > > when multiple hosts used. > > Message-ID: <3cf53d73-15d9-40bb-a2de-50ba3561a...@open-mpi.org> > > Content-Type: text/plain; charset="us-ascii" > > > > Hmmm...just testing on my little cluster here on two nodes, it works > just fine with 1.8.2: > > > > [rhc@bend001 v1.8]$ mpirun -n 2 --map-by node ./a.out > > In rank 0 and host= bend001 Do Barrier call 1. > > In rank 0 and host= bend001 Do Barrier call 2. > > In rank 0 and host= bend001 Do Barrier call 3. > > In rank 1 and host= bend002 Do Barrier call 1. > > In rank 1 and host= bend002 Do Barrier call 2. > > In rank 1 and host= bend002 Do Barrier call 3. > > [rhc@bend001 v1.8]$ > > > > > > How are you configuring OMPI? > > > > > > On May 2, 2014, at 2:24 PM, Clay Kirkland <clay.kirkl...@versityinc.com> > wrote: > > > > > I have been using MPI for many many years so I have very well debugged > mpi tests. I am > > > having trouble on either openmpi-1.4.5 or openmpi-1.6.5 versions > though with getting the > > > MPI_Barrier calls to work. It works fine when I run all processes on > one machine but when > > > I run with two or more hosts the second call to MPI_Barrier always > hangs. Not the first one, > > > but always the second one. I looked at FAQ's and such but found > nothing except for a comment > > > that MPI_Barrier problems were often problems with fire walls. Also > mentioned as a problem > > > was not having the same version of mpi on both machines. I turned > firewalls off and removed > > > and reinstalled the same version on both hosts but I still see the > same thing. I then installed > > > lam mpi on two of my machines and that works fine. I can call the > MPI_Barrier function when run on > > > one of two machines by itself many times with no hangs. Only hangs > if two or more hosts are involved. > > > These runs are all being done on CentOS release 6.4. Here is test > program I used. > > > > > > main (argc, argv) > > > int argc; > > > char **argv; > > > { > > > char message[20]; > > > char hoster[256]; > > > char nameis[256]; > > > int fd, i, j, jnp, iret, myrank, np, ranker, recker; > > > MPI_Comm comm; > > > MPI_Status status; > > > > > > MPI_Init( &argc, &argv ); > > > MPI_Comm_rank( MPI_COMM_WORLD, &myrank); > > > MPI_Comm_size( MPI_COMM_WORLD, &np); > > > > > > gethostname(hoster,256); > > > > > > printf(" In rank %d and host= %s Do Barrier call > 1.\n",myrank,hoster); > > > MPI_Barrier(MPI_COMM_WORLD); > > > printf(" In rank %d and host= %s Do Barrier call > 2.\n",myrank,hoster); > > > MPI_Barrier(MPI_COMM_WORLD); > > > printf(" In rank %d and host= %s Do Barrier call > 3.\n",myrank,hoster); > > > MPI_Barrier(MPI_COMM_WORLD); > > > MPI_Finalize(); > > > exit(0); > > > } > > > > > > Here are three runs of test program. First with two processes on > one host, then with > > > two processes on another host, and finally with one process on each of > two hosts. The > > > first two runs are fine but the last run hangs on the second > MPI_Barrier. > > > > > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out > > > In rank 0 and host= centos Do Barrier call 1. > > > In rank 1 and host= centos Do Barrier call 1. > > > In rank 1 and host= centos Do Barrier call 2. > > > In rank 1 and host= centos Do Barrier call 3. > > > In rank 0 and host= centos Do Barrier call 2. > > > In rank 0 and host= centos Do Barrier call 3. > > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out > > > /root/.bashrc: line 14: unalias: ls: not found > > > In rank 0 and host= RAID Do Barrier call 1. > > > In rank 0 and host= RAID Do Barrier call 2. > > > In rank 0 and host= RAID Do Barrier call 3. > > > In rank 1 and host= RAID Do Barrier call 1. > > > In rank 1 and host= RAID Do Barrier call 2. > > > In rank 1 and host= RAID Do Barrier call 3. > > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos,RAID > a.out > > > /root/.bashrc: line 14: unalias: ls: not found > > > In rank 0 and host= centos Do Barrier call 1. > > > In rank 0 and host= centos Do Barrier call 2. > > > In rank 1 and host= RAID Do Barrier call 1. > > > In rank 1 and host= RAID Do Barrier call 2. > > > > > > Since it is such a simple test and problem and such a widely used > MPI function, it must obviously > > > be an installation or configuration problem. A pstack for each of > the hung MPI_Barrier processes > > > on the two machines shows this: > > > > > > [root@centos ~]# pstack 31666 > > > #0 0x0000003baf0e8ee3 in __epoll_wait_nocancel () from > /lib64/libc.so.6 > > > #1 0x00007f5de06125eb in epoll_dispatch () from > /usr/local/lib/libmpi.so.1 > > > #2 0x00007f5de061475a in opal_event_base_loop () from > /usr/local/lib/libmpi.so.1 > > > #3 0x00007f5de0639229 in opal_progress () from > /usr/local/lib/libmpi.so.1 > > > #4 0x00007f5de0586f75 in ompi_request_default_wait_all () from > /usr/local/lib/libmpi.so.1 > > > #5 0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from > /usr/local/lib/openmpi/mca_coll_tuned.so > > > #6 0x00007f5ddc59d8ff in ompi_coll_tuned_barrier_intra_two_procs () > from /usr/local/lib/openmpi/mca_coll_tuned.so > > > #7 0x00007f5de05941c2 in PMPI_Barrier () from > /usr/local/lib/libmpi.so.1 > > > #8 0x0000000000400a43 in main () > > > > > > [root@RAID openmpi-1.6.5]# pstack 22167 > > > #0 0x00000030302e8ee3 in __epoll_wait_nocancel () from > /lib64/libc.so.6 > > > #1 0x00007f7ee46885eb in epoll_dispatch () from > /usr/local/lib/libmpi.so.1 > > > #2 0x00007f7ee468a75a in opal_event_base_loop () from > /usr/local/lib/libmpi.so.1 > > > #3 0x00007f7ee46af229 in opal_progress () from > /usr/local/lib/libmpi.so.1 > > > #4 0x00007f7ee45fcf75 in ompi_request_default_wait_all () from > /usr/local/lib/libmpi.so.1 > > > #5 0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from > /usr/local/lib/openmpi/mca_coll_tuned.so > > > #6 0x00007f7ee06138ff in ompi_coll_tuned_barrier_intra_two_procs () > from /usr/local/lib/openmpi/mca_coll_tuned.so > > > #7 0x00007f7ee460a1c2 in PMPI_Barrier () from > /usr/local/lib/libmpi.so.1 > > > #8 0x0000000000400a43 in main () > > > > > > Which looks exactly the same on each machine. Any thoughts or ideas > would be greatly appreciated as > > > I am stuck. > > > > > > Clay Kirkland > > > > > > > > > > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -------------- next part -------------- > > HTML attachment scrubbed and removed > > > > ------------------------------ > > > > Subject: Digest Footer > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ------------------------------ > > > > End of users Digest, Vol 2879, Issue 1 > > ************************************** > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ------------------------------ > > End of users Digest, Vol 2881, Issue 1 > ************************************** >