Got it to work finally. The longer line doesn't work. But if I take off the -mca oob_tcp_if_include 192.168.0.0/16 part then everything works from every combination of machines I have.
And as to any MPI having trouble, in my original posting I stated that I installed lam mpi on the same hardware and it worked just fine. Maybe you guys should look at what they do and copy it. Virtually every machine I have used in the last 5 years has multiple nic interfaces and almost all of them are set up to use only 1 interface. It seems odd to have a product that is designed to lash together multiple machines and have it fail with default install on generic machines. But software is like that some time and I want to thank you much for all the help. Please take my criticism with a grain of salt. I love MPI, I just want to see it work. I have been using it for 20 some years to synchronize multiple machines for I/O testing and it is one slick product for that. It has helped us find many bugs in shared files systems. Thanks again, On Tue, May 6, 2014 at 7:45 PM, <users-requ...@open-mpi.org> wrote: > Send users mailing list submissions to > us...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.open-mpi.org/mailman/listinfo.cgi/users > or, via email, send a message with subject or body 'help' to > users-requ...@open-mpi.org > > You can reach the person managing the list at > users-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of users digest..." > > > Today's Topics: > > 1. Re: users Digest, Vol 2881, Issue 2 (Ralph Castain) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 6 May 2014 17:45:09 -0700 > From: Ralph Castain <r...@open-mpi.org> > To: Open MPI Users <us...@open-mpi.org> > Subject: Re: [OMPI users] users Digest, Vol 2881, Issue 2 > Message-ID: <4b207e61-952a-4744-9a7b-0704c4b0d...@open-mpi.org> > Content-Type: text/plain; charset="us-ascii" > > -mca btl_tcp_if_include 192.168.0.0/16 -mca oob_tcp_if_include > 192.168.0.0/16 > > should do the trick. Any MPI is going to have trouble with your > arrangement - just need a little hint to help figure it out. > > > On May 6, 2014, at 5:14 PM, Clay Kirkland <clay.kirkl...@versityinc.com> > wrote: > > > Someone suggested using some network address if all machines are on > same subnet. > > They are all on the same subnet, I think. I have no idea what to put > for a param there. > > I tried the ethernet address but of course it couldn't be that simple. > Here are my ifconfig > > outputs from a couple of machines: > > > > [root@RAID MPI]# ifconfig -a > > eth0 Link encap:Ethernet HWaddr 00:25:90:73:2A:36 > > inet addr:192.168.0.59 Bcast:192.168.0.255 Mask:255.255.255.0 > > inet6 addr: fe80::225:90ff:fe73:2a36/64 Scope:Link > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > RX packets:17983 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:9952 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:1000 > > RX bytes:26309771 (25.0 MiB) TX bytes:758940 (741.1 KiB) > > Interrupt:16 Memory:fbde0000-fbe00000 > > > > eth1 Link encap:Ethernet HWaddr 00:25:90:73:2A:37 > > inet6 addr: fe80::225:90ff:fe73:2a37/64 Scope:Link > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > RX packets:56 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:6 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:1000 > > RX bytes:3924 (3.8 KiB) TX bytes:468 (468.0 b) > > Interrupt:17 Memory:fbee0000-fbf00000 > > > > And from one that I can't get to work: > > > > [root@centos ~]# ifconfig -a > > eth0 Link encap:Ethernet HWaddr 00:1E:4F:FB:30:34 > > inet6 addr: fe80::21e:4fff:fefb:3034/64 Scope:Link > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > RX packets:45 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:6 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:1000 > > RX bytes:2700 (2.6 KiB) TX bytes:468 (468.0 b) > > Interrupt:21 Memory:fe9e0000-fea00000 > > > > eth1 Link encap:Ethernet HWaddr 00:14:D1:22:8E:50 > > inet addr:192.168.0.154 Bcast:192.168.0.255 > Mask:255.255.255.0 > > inet6 addr: fe80::214:d1ff:fe22:8e50/64 Scope:Link > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > RX packets:160 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:120 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:1000 > > RX bytes:31053 (30.3 KiB) TX bytes:18897 (18.4 KiB) > > Interrupt:16 Base address:0x2f00 > > > > > > The centos machine is using eth1 and not eth0, therein lies the problem. > > > > I don't really need all this optimization of using multiple ethernet > adaptors to speed things > > up. I am just using MPI to synchronize I/O tests. Can I go back to a > really old version > > and avoid all this pain full debugging??? > > > > > > > > > > On Tue, May 6, 2014 at 6:50 PM, <users-requ...@open-mpi.org> wrote: > > Send users mailing list submissions to > > us...@open-mpi.org > > > > To subscribe or unsubscribe via the World Wide Web, visit > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > or, via email, send a message with subject or body 'help' to > > users-requ...@open-mpi.org > > > > You can reach the person managing the list at > > users-ow...@open-mpi.org > > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of users digest..." > > > > > > Today's Topics: > > > > 1. Re: users Digest, Vol 2881, Issue 1 (Clay Kirkland) > > 2. Re: users Digest, Vol 2881, Issue 1 (Clay Kirkland) > > > > > > ---------------------------------------------------------------------- > > > > Message: 1 > > Date: Tue, 6 May 2014 18:28:59 -0500 > > From: Clay Kirkland <clay.kirkl...@versityinc.com> > > To: us...@open-mpi.org > > Subject: Re: [OMPI users] users Digest, Vol 2881, Issue 1 > > Message-ID: > > < > cajdnja90buhwu_ihssnna1a4p35+o96rrxk19xnhwo-nsd_...@mail.gmail.com> > > Content-Type: text/plain; charset="utf-8" > > > > That last trick seems to work. I can get it to work once in a while > with > > those tcp options but it is > > tricky as I have three machines and two of them use eth0 as primary > network > > interface and one > > uses eth1. But by fiddling with network options and perhaps moving a > > cable or two I think I can > > get it all to work Thanks much for the tip. > > > > Clay > > > > > > On Tue, May 6, 2014 at 11:00 AM, <users-requ...@open-mpi.org> wrote: > > > > > Send users mailing list submissions to > > > us...@open-mpi.org > > > > > > To subscribe or unsubscribe via the World Wide Web, visit > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > or, via email, send a message with subject or body 'help' to > > > users-requ...@open-mpi.org > > > > > > You can reach the person managing the list at > > > users-ow...@open-mpi.org > > > > > > When replying, please edit your Subject line so it is more specific > > > than "Re: Contents of users digest..." > > > > > > > > > Today's Topics: > > > > > > 1. Re: MPI_Barrier hangs on second attempt but only when > > > multiple hosts used. (Daniels, Marcus G) > > > 2. ROMIO bug reading darrays (Richard Shaw) > > > 3. MPI File Open does not work (Imran Ali) > > > 4. Re: MPI File Open does not work (Jeff Squyres (jsquyres)) > > > 5. Re: MPI File Open does not work (Imran Ali) > > > 6. Re: MPI File Open does not work (Jeff Squyres (jsquyres)) > > > 7. Re: MPI File Open does not work (Imran Ali) > > > 8. Re: MPI File Open does not work (Jeff Squyres (jsquyres)) > > > 9. Re: users Digest, Vol 2879, Issue 1 (Jeff Squyres (jsquyres)) > > > > > > > > > ---------------------------------------------------------------------- > > > > > > Message: 1 > > > Date: Mon, 5 May 2014 19:28:07 +0000 > > > From: "Daniels, Marcus G" <mdani...@lanl.gov> > > > To: "'us...@open-mpi.org'" <us...@open-mpi.org> > > > Subject: Re: [OMPI users] MPI_Barrier hangs on second attempt but only > > > when multiple hosts used. > > > Message-ID: > > > < > > > 532c594b7920a549a2a91cb4312cc57640dc5...@ecs-exg-p-mb01.win.lanl.gov> > > > Content-Type: text/plain; charset="utf-8" > > > > > > > > > > > > From: Clay Kirkland [mailto:clay.kirkl...@versityinc.com] > > > Sent: Friday, May 02, 2014 03:24 PM > > > To: us...@open-mpi.org <us...@open-mpi.org> > > > Subject: [OMPI users] MPI_Barrier hangs on second attempt but only when > > > multiple hosts used. > > > > > > I have been using MPI for many many years so I have very well debugged > mpi > > > tests. I am > > > having trouble on either openmpi-1.4.5 or openmpi-1.6.5 versions > though > > > with getting the > > > MPI_Barrier calls to work. It works fine when I run all processes on > one > > > machine but when > > > I run with two or more hosts the second call to MPI_Barrier always > hangs. > > > Not the first one, > > > but always the second one. I looked at FAQ's and such but found > nothing > > > except for a comment > > > that MPI_Barrier problems were often problems with fire walls. Also > > > mentioned as a problem > > > was not having the same version of mpi on both machines. I turned > > > firewalls off and removed > > > and reinstalled the same version on both hosts but I still see the same > > > thing. I then installed > > > lam mpi on two of my machines and that works fine. I can call the > > > MPI_Barrier function when run on > > > one of two machines by itself many times with no hangs. Only hangs if > > > two or more hosts are involved. > > > These runs are all being done on CentOS release 6.4. Here is test > > > program I used. > > > > > > main (argc, argv) > > > int argc; > > > char **argv; > > > { > > > char message[20]; > > > char hoster[256]; > > > char nameis[256]; > > > int fd, i, j, jnp, iret, myrank, np, ranker, recker; > > > MPI_Comm comm; > > > MPI_Status status; > > > > > > MPI_Init( &argc, &argv ); > > > MPI_Comm_rank( MPI_COMM_WORLD, &myrank); > > > MPI_Comm_size( MPI_COMM_WORLD, &np); > > > > > > gethostname(hoster,256); > > > > > > printf(" In rank %d and host= %s Do Barrier call > > > 1.\n",myrank,hoster); > > > MPI_Barrier(MPI_COMM_WORLD); > > > printf(" In rank %d and host= %s Do Barrier call > > > 2.\n",myrank,hoster); > > > MPI_Barrier(MPI_COMM_WORLD); > > > printf(" In rank %d and host= %s Do Barrier call > > > 3.\n",myrank,hoster); > > > MPI_Barrier(MPI_COMM_WORLD); > > > MPI_Finalize(); > > > exit(0); > > > } > > > > > > Here are three runs of test program. First with two processes on one > > > host, then with > > > two processes on another host, and finally with one process on each of > two > > > hosts. The > > > first two runs are fine but the last run hangs on the second > MPI_Barrier. > > > > > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out > > > In rank 0 and host= centos Do Barrier call 1. > > > In rank 1 and host= centos Do Barrier call 1. > > > In rank 1 and host= centos Do Barrier call 2. > > > In rank 1 and host= centos Do Barrier call 3. > > > In rank 0 and host= centos Do Barrier call 2. > > > In rank 0 and host= centos Do Barrier call 3. > > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out > > > /root/.bashrc: line 14: unalias: ls: not found > > > In rank 0 and host= RAID Do Barrier call 1. > > > In rank 0 and host= RAID Do Barrier call 2. > > > In rank 0 and host= RAID Do Barrier call 3. > > > In rank 1 and host= RAID Do Barrier call 1. > > > In rank 1 and host= RAID Do Barrier call 2. > > > In rank 1 and host= RAID Do Barrier call 3. > > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos,RAID > a.out > > > /root/.bashrc: line 14: unalias: ls: not found > > > In rank 0 and host= centos Do Barrier call 1. > > > In rank 0 and host= centos Do Barrier call 2. > > > In rank 1 and host= RAID Do Barrier call 1. > > > In rank 1 and host= RAID Do Barrier call 2. > > > > > > Since it is such a simple test and problem and such a widely used MPI > > > function, it must obviously > > > be an installation or configuration problem. A pstack for each of the > > > hung MPI_Barrier processes > > > on the two machines shows this: > > > > > > [root@centos ~]# pstack 31666 > > > #0 0x0000003baf0e8ee3 in __epoll_wait_nocancel () from > /lib64/libc.so.6 > > > #1 0x00007f5de06125eb in epoll_dispatch () from > /usr/local/lib/libmpi.so.1 > > > #2 0x00007f5de061475a in opal_event_base_loop () from > > > /usr/local/lib/libmpi.so.1 > > > #3 0x00007f5de0639229 in opal_progress () from > /usr/local/lib/libmpi.so.1 > > > #4 0x00007f5de0586f75 in ompi_request_default_wait_all () from > > > /usr/local/lib/libmpi.so.1 > > > #5 0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from > > > /usr/local/lib/openmpi/mca_coll_tuned.so > > > #6 0x00007f5ddc59d8ff in ompi_coll_tuned_barrier_intra_two_procs () > from > > > /usr/local/lib/openmpi/mca_coll_tuned.so > > > #7 0x00007f5de05941c2 in PMPI_Barrier () from > /usr/local/lib/libmpi.so.1 > > > #8 0x0000000000400a43 in main () > > > > > > [root@RAID openmpi-1.6.5]# pstack 22167 > > > #0 0x00000030302e8ee3 in __epoll_wait_nocancel () from > /lib64/libc.so.6 > > > #1 0x00007f7ee46885eb in epoll_dispatch () from > /usr/local/lib/libmpi.so.1 > > > #2 0x00007f7ee468a75a in opal_event_base_loop () from > > > /usr/local/lib/libmpi.so.1 > > > #3 0x00007f7ee46af229 in opal_progress () from > /usr/local/lib/libmpi.so.1 > > > #4 0x00007f7ee45fcf75 in ompi_request_default_wait_all () from > > > /usr/local/lib/libmpi.so.1 > > > #5 0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from > > > /usr/local/lib/openmpi/mca_coll_tuned.so > > > #6 0x00007f7ee06138ff in ompi_coll_tuned_barrier_intra_two_procs () > from > > > /usr/local/lib/openmpi/mca_coll_tuned.so > > > #7 0x00007f7ee460a1c2 in PMPI_Barrier () from > /usr/local/lib/libmpi.so.1 > > > #8 0x0000000000400a43 in main () > > > > > > Which looks exactly the same on each machine. Any thoughts or ideas > > > would be greatly appreciated as > > > I am stuck. > > > > > > Clay Kirkland > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- > > > HTML attachment scrubbed and removed > > > > > > ------------------------------ > > > > > > Message: 2 > > > Date: Mon, 5 May 2014 22:20:59 -0400 > > > From: Richard Shaw <jr...@cita.utoronto.ca> > > > To: Open MPI Users <us...@open-mpi.org> > > > Subject: [OMPI users] ROMIO bug reading darrays > > > Message-ID: > > > < > > > can+evmkc+9kacnpausscziufwdj3jfcsymb-8zdx1etdkab...@mail.gmail.com> > > > Content-Type: text/plain; charset="utf-8" > > > > > > Hello, > > > > > > I think I've come across a bug when using ROMIO to read in a 2D > distributed > > > array. I've attached a test case to this email. > > > > > > In the testcase I first initialise an array of 25 doubles (which will > be a > > > 5x5 grid), then I create a datatype representing a 5x5 matrix > distributed > > > in 3x3 blocks over a 2x2 process grid. As a reference I use MPI_Pack to > > > pull out the block cyclic array elements local to each process (which I > > > think is correct). Then I write the original array of 25 doubles to > disk, > > > and use MPI-IO to read it back in (performing the Open, Set_view, and > > > Real_all), and compare to the reference. > > > > > > Running this with OMPI, the two match on all ranks. > > > > > > > mpirun -mca io ompio -np 4 ./darr_read.x > > > === Rank 0 === (9 elements) > > > Packed: 0.0 1.0 2.0 5.0 6.0 7.0 10.0 11.0 12.0 > > > Read: 0.0 1.0 2.0 5.0 6.0 7.0 10.0 11.0 12.0 > > > > > > === Rank 1 === (6 elements) > > > Packed: 15.0 16.0 17.0 20.0 21.0 22.0 > > > Read: 15.0 16.0 17.0 20.0 21.0 22.0 > > > > > > === Rank 2 === (6 elements) > > > Packed: 3.0 4.0 8.0 9.0 13.0 14.0 > > > Read: 3.0 4.0 8.0 9.0 13.0 14.0 > > > > > > === Rank 3 === (4 elements) > > > Packed: 18.0 19.0 23.0 24.0 > > > Read: 18.0 19.0 23.0 24.0 > > > > > > > > > > > > However, using ROMIO the two differ on two of the ranks: > > > > > > > mpirun -mca io romio -np 4 ./darr_read.x > > > === Rank 0 === (9 elements) > > > Packed: 0.0 1.0 2.0 5.0 6.0 7.0 10.0 11.0 12.0 > > > Read: 0.0 1.0 2.0 5.0 6.0 7.0 10.0 11.0 12.0 > > > > > > === Rank 1 === (6 elements) > > > Packed: 15.0 16.0 17.0 20.0 21.0 22.0 > > > Read: 0.0 1.0 2.0 0.0 1.0 2.0 > > > > > > === Rank 2 === (6 elements) > > > Packed: 3.0 4.0 8.0 9.0 13.0 14.0 > > > Read: 3.0 4.0 8.0 9.0 13.0 14.0 > > > > > > === Rank 3 === (4 elements) > > > Packed: 18.0 19.0 23.0 24.0 > > > Read: 0.0 1.0 0.0 1.0 > > > > > > > > > > > > My interpretation is that the behaviour with OMPIO is correct. > > > Interestingly everything matches up using both ROMIO and OMPIO if I > set the > > > block shape to 2x2. > > > > > > This was run on OS X using 1.8.2a1r31632. I have also run this on Linux > > > with OpenMPI 1.7.4, and OMPIO is still correct, but using ROMIO I just > get > > > segfaults. > > > > > > Thanks, > > > Richard > > > -------------- next part -------------- > > > HTML attachment scrubbed and removed > > > -------------- next part -------------- > > > A non-text attachment was scrubbed... > > > Name: darr_read.c > > > Type: text/x-csrc > > > Size: 2218 bytes > > > Desc: not available > > > URL: < > > > > http://www.open-mpi.org/MailArchives/users/attachments/20140505/5a5ab0ba/attachment.bin > > > > > > > > > > ------------------------------ > > > > > > Message: 3 > > > Date: Tue, 06 May 2014 13:24:35 +0200 > > > From: Imran Ali <imra...@student.matnat.uio.no> > > > To: <us...@open-mpi.org> > > > Subject: [OMPI users] MPI File Open does not work > > > Message-ID: <d57bdf499c00360b737205b085c50...@ulrik.uio.no> > > > Content-Type: text/plain; charset="utf-8" > > > > > > > > > > > > I get the following error when I try to run the following python > > > code > > > import mpi4py.MPI as MPI > > > comm = MPI.COMM_WORLD > > > > > > MPI.File.Open(comm,"some.file") > > > > > > $ mpirun -np 1 python > > > test_mpi.py > > > Traceback (most recent call last): > > > File "test_mpi.py", line > > > 3, in <module> > > > MPI.File.Open(comm," h5ex_d_alloc.h5") > > > File "File.pyx", > > > line 67, in mpi4py.MPI.File.Open > > > (src/mpi4py.MPI.c:89639) > > > mpi4py.MPI.Exception: MPI_ERR_OTHER: known > > > error not in > > > list > > > > -------------------------------------------------------------------------- > > > mpirun > > > noticed that the job aborted, but has no info as to the process > > > that > > > caused that > > > situation. > > > > -------------------------------------------------------------------------- > > > > > > > > > My mpirun version is (Open MPI) 1.6.2. I installed openmpi using the > > > dorsal script (https://github.com/FEniCS/dorsal) for Redhat > Enterprise 6 > > > (OS I am using, release 6.5) . It configured the build as following : > > > > > > > > > ./configure --enable-mpi-thread-multiple --enable-opal-multi-threads > > > --with-threads=posix --disable-mpi-profile > > > > > > I need emphasize that I do > > > not have root acces on the system I am running my application. > > > > > > Imran > > > > > > > > > > > > -------------- next part -------------- > > > HTML attachment scrubbed and removed > > > > > > ------------------------------ > > > > > > Message: 4 > > > Date: Tue, 6 May 2014 12:56:04 +0000 > > > From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > > > To: Open MPI Users <us...@open-mpi.org> > > > Subject: Re: [OMPI users] MPI File Open does not work > > > Message-ID: <e7df28cb-d4fb-4087-928e-18e61d1d2...@cisco.com> > > > Content-Type: text/plain; charset="us-ascii" > > > > > > The thread support in the 1.6 series is not very good. You might try: > > > > > > - Upgrading to 1.6.5 > > > - Or better yet, upgrading to 1.8.1 > > > > > > > > > On May 6, 2014, at 7:24 AM, Imran Ali <imra...@student.matnat.uio.no> > > > wrote: > > > > > > > I get the following error when I try to run the following python code > > > > > > > > import mpi4py.MPI as MPI > > > > comm = MPI.COMM_WORLD > > > > MPI.File.Open(comm,"some.file") > > > > > > > > $ mpirun -np 1 python test_mpi.py > > > > Traceback (most recent call last): > > > > File "test_mpi.py", line 3, in <module> > > > > MPI.File.Open(comm," h5ex_d_alloc.h5") > > > > File "File.pyx", line 67, in mpi4py.MPI.File.Open > > > (src/mpi4py.MPI.c:89639) > > > > mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list > > > > > > > > -------------------------------------------------------------------------- > > > > mpirun noticed that the job aborted, but has no info as to the > process > > > > that caused that situation. > > > > > > > > -------------------------------------------------------------------------- > > > > > > > > My mpirun version is (Open MPI) 1.6.2. I installed openmpi using the > > > dorsal script (https://github.com/FEniCS/dorsal) for Redhat > Enterprise 6 > > > (OS I am using, release 6.5) . It configured the build as following : > > > > > > > > ./configure --enable-mpi-thread-multiple --enable-opal-multi-threads > > > --with-threads=posix --disable-mpi-profile > > > > > > > > I need emphasize that I do not have root acces on the system I am > > > running my application. > > > > > > > > Imran > > > > > > > > > > > > _______________________________________________ > > > > users mailing list > > > > us...@open-mpi.org > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > -- > > > Jeff Squyres > > > jsquy...@cisco.com > > > For corporate legal information go to: > > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > > > > > > > ------------------------------ > > > > > > Message: 5 > > > Date: Tue, 6 May 2014 15:32:21 +0200 > > > From: Imran Ali <imra...@student.matnat.uio.no> > > > To: Open MPI Users <us...@open-mpi.org> > > > Subject: Re: [OMPI users] MPI File Open does not work > > > Message-ID: <fa6dffff-6c66-4a47-84fc-148fb51ce...@math.uio.no> > > > Content-Type: text/plain; charset=us-ascii > > > > > > > > > 6. mai 2014 kl. 14:56 skrev Jeff Squyres (jsquyres) < > jsquy...@cisco.com>: > > > > > > > The thread support in the 1.6 series is not very good. You might > try: > > > > > > > > - Upgrading to 1.6.5 > > > > - Or better yet, upgrading to 1.8.1 > > > > > > > > > > I will attempt that than. I read at > > > > > > http://www.open-mpi.org/faq/?category=building#install-overwrite > > > > > > that I should completely uninstall my previous version. Could you > > > recommend to me how I can go about doing it (without root access). > > > I am uncertain where I can use make uninstall. > > > > > > Imran > > > > > > > > > > > On May 6, 2014, at 7:24 AM, Imran Ali <imra...@student.matnat.uio.no > > > > > wrote: > > > > > > > >> I get the following error when I try to run the following python > code > > > >> > > > >> import mpi4py.MPI as MPI > > > >> comm = MPI.COMM_WORLD > > > >> MPI.File.Open(comm,"some.file") > > > >> > > > >> $ mpirun -np 1 python test_mpi.py > > > >> Traceback (most recent call last): > > > >> File "test_mpi.py", line 3, in <module> > > > >> MPI.File.Open(comm," h5ex_d_alloc.h5") > > > >> File "File.pyx", line 67, in mpi4py.MPI.File.Open > > > (src/mpi4py.MPI.c:89639) > > > >> mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list > > > >> > > > > -------------------------------------------------------------------------- > > > >> mpirun noticed that the job aborted, but has no info as to the > process > > > >> that caused that situation. > > > >> > > > > -------------------------------------------------------------------------- > > > >> > > > >> My mpirun version is (Open MPI) 1.6.2. I installed openmpi using the > > > dorsal script (https://github.com/FEniCS/dorsal) for Redhat > Enterprise 6 > > > (OS I am using, release 6.5) . It configured the build as following : > > > >> > > > >> ./configure --enable-mpi-thread-multiple --enable-opal-multi-threads > > > --with-threads=posix --disable-mpi-profile > > > >> > > > >> I need emphasize that I do not have root acces on the system I am > > > running my application. > > > >> > > > >> Imran > > > >> > > > >> > > > >> _______________________________________________ > > > >> users mailing list > > > >> us...@open-mpi.org > > > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > > > > -- > > > > Jeff Squyres > > > > jsquy...@cisco.com > > > > For corporate legal information go to: > > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > > > _______________________________________________ > > > > users mailing list > > > > us...@open-mpi.org > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > > > > ------------------------------ > > > > > > Message: 6 > > > Date: Tue, 6 May 2014 13:34:38 +0000 > > > From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > > > To: Open MPI Users <us...@open-mpi.org> > > > Subject: Re: [OMPI users] MPI File Open does not work > > > Message-ID: <2a933c0e-80f6-4ded-b44c-53b5f37ef...@cisco.com> > > > Content-Type: text/plain; charset="us-ascii" > > > > > > On May 6, 2014, at 9:32 AM, Imran Ali <imra...@student.matnat.uio.no> > > > wrote: > > > > > > > I will attempt that than. I read at > > > > > > > > http://www.open-mpi.org/faq/?category=building#install-overwrite > > > > > > > > that I should completely uninstall my previous version. > > > > > > Yes, that is best. OR: you can install into a whole separate tree and > > > ignore the first installation. > > > > > > > Could you recommend to me how I can go about doing it (without root > > > access). > > > > I am uncertain where I can use make uninstall. > > > > > > If you don't have write access into the installation tree (i.e., it was > > > installed via root and you don't have root access), then your best bet > is > > > simply to install into a new tree. E.g., if OMPI is installed into > > > /opt/openmpi-1.6.2, try installing into /opt/openmpi-1.6.5, or even > > > $HOME/installs/openmpi-1.6.5, or something like that. > > > > > > -- > > > Jeff Squyres > > > jsquy...@cisco.com > > > For corporate legal information go to: > > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > > > > > > > ------------------------------ > > > > > > Message: 7 > > > Date: Tue, 6 May 2014 15:40:34 +0200 > > > From: Imran Ali <imra...@student.matnat.uio.no> > > > To: Open MPI Users <us...@open-mpi.org> > > > Subject: Re: [OMPI users] MPI File Open does not work > > > Message-ID: <14f0596c-c5c5-4b1a-a4a8-8849d44ab...@math.uio.no> > > > Content-Type: text/plain; charset=us-ascii > > > > > > > > > 6. mai 2014 kl. 15:34 skrev Jeff Squyres (jsquyres) < > jsquy...@cisco.com>: > > > > > > > On May 6, 2014, at 9:32 AM, Imran Ali <imra...@student.matnat.uio.no > > > > > wrote: > > > > > > > >> I will attempt that than. I read at > > > >> > > > >> http://www.open-mpi.org/faq/?category=building#install-overwrite > > > >> > > > >> that I should completely uninstall my previous version. > > > > > > > > Yes, that is best. OR: you can install into a whole separate tree > and > > > ignore the first installation. > > > > > > > >> Could you recommend to me how I can go about doing it (without root > > > access). > > > >> I am uncertain where I can use make uninstall. > > > > > > > > If you don't have write access into the installation tree (i.e., it > was > > > installed via root and you don't have root access), then your best bet > is > > > simply to install into a new tree. E.g., if OMPI is installed into > > > /opt/openmpi-1.6.2, try installing into /opt/openmpi-1.6.5, or even > > > $HOME/installs/openmpi-1.6.5, or something like that. > > > > > > My install was in my user directory (i.e $HOME). I managed to locate > the > > > source directory and successfully run make uninstall. > > > > > > Will let you know how things went after installation. > > > > > > Imran > > > > > > > > > > > -- > > > > Jeff Squyres > > > > jsquy...@cisco.com > > > > For corporate legal information go to: > > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > > > _______________________________________________ > > > > users mailing list > > > > us...@open-mpi.org > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > > > > ------------------------------ > > > > > > Message: 8 > > > Date: Tue, 6 May 2014 14:42:52 +0000 > > > From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > > > To: Open MPI Users <us...@open-mpi.org> > > > Subject: Re: [OMPI users] MPI File Open does not work > > > Message-ID: <710e3328-edaa-4a13-9f07-b45fe3191...@cisco.com> > > > Content-Type: text/plain; charset="us-ascii" > > > > > > On May 6, 2014, at 9:40 AM, Imran Ali <imra...@student.matnat.uio.no> > > > wrote: > > > > > > > My install was in my user directory (i.e $HOME). I managed to locate > the > > > source directory and successfully run make uninstall. > > > > > > > > > FWIW, I usually install Open MPI into its own subdir. E.g., > > > $HOME/installs/openmpi-x.y.z. Then if I don't want that install any > more, > > > I can just "rm -rf $HOME/installs/openmpi-x.y.z" -- no need to "make > > > uninstall". Specifically: if there's nothing else installed in the > same > > > tree as Open MPI, you can just rm -rf its installation tree. > > > > > > -- > > > Jeff Squyres > > > jsquy...@cisco.com > > > For corporate legal information go to: > > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > > > > > > > ------------------------------ > > > > > > Message: 9 > > > Date: Tue, 6 May 2014 14:50:34 +0000 > > > From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > > > To: Open MPI Users <us...@open-mpi.org> > > > Subject: Re: [OMPI users] users Digest, Vol 2879, Issue 1 > > > Message-ID: <c60aa7e1-96a7-4ccd-9b5b-11a38fb87...@cisco.com> > > > Content-Type: text/plain; charset="us-ascii" > > > > > > Are you using TCP as the MPI transport? > > > > > > If so, another thing to try is to limit the IP interfaces that MPI uses > > > for its traffic to see if there's some kind of problem with specific > > > networks. > > > > > > For example: > > > > > > mpirun --mca btl_tcp_if_include eth0 ... > > > > > > If that works, then try adding in any/all other IP interfaces that you > > > have on your machines. > > > > > > A sorta-wild guess: you have some IP interfaces that aren't working, > or at > > > least, don't work in the way that OMPI wants them to work. So the > first > > > barrier works because it flows across eth0 (or some other first network > > > that, as far as OMPI is concerned, works just fine). But then the next > > > barrier round-robin advances to the next IP interface, and it doesn't > work > > > for some reason. > > > > > > We've seen virtual machine bridge interfaces cause problems, for > example. > > > E.g., if a machine has a Xen virtual machine interface (vibr0, IIRC?), > > > then OMPI will try to use it to communicate with peer MPI processes > because > > > it has a "compatible" IP address, and OMPI think it should be > > > connected/reachable to peers. If this is the case, you might want to > > > disable such interfaces and/or use btl_tcp_if_include or > btl_tcp_if_exclude > > > to select the interfaces that you want to use. > > > > > > Pro tip: if you use btl_tcp_if_exclude, remember to exclude the > loopback > > > interface, too. OMPI defaults to a btl_tcp_if_include="" (blank) and > > > btl_tcp_if_exclude="lo0". So if you override btl_tcp_if_exclude, you > need > > > to be sure to *also* include lo0 in the new value. For example: > > > > > > mpirun --mca btl_tcp_if_exclude lo0,virb0 ... > > > > > > Also, if possible, try upgrading to Open MPI 1.8.1. > > > > > > > > > > > > On May 4, 2014, at 2:15 PM, Clay Kirkland < > clay.kirkl...@versityinc.com> > > > wrote: > > > > > > > I am configuring with all defaults. Just doing a ./configure and > then > > > > make and make install. I have used open mpi on several kinds of > > > > unix systems this way and have had no trouble before. I believe I > > > > last had success on a redhat version of linux. > > > > > > > > > > > > On Sat, May 3, 2014 at 11:00 AM, <users-requ...@open-mpi.org> wrote: > > > > Send users mailing list submissions to > > > > us...@open-mpi.org > > > > > > > > To subscribe or unsubscribe via the World Wide Web, visit > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > or, via email, send a message with subject or body 'help' to > > > > users-requ...@open-mpi.org > > > > > > > > You can reach the person managing the list at > > > > users-ow...@open-mpi.org > > > > > > > > When replying, please edit your Subject line so it is more specific > > > > than "Re: Contents of users digest..." > > > > > > > > > > > > Today's Topics: > > > > > > > > 1. MPI_Barrier hangs on second attempt but only when multiple > > > > hosts used. (Clay Kirkland) > > > > 2. Re: MPI_Barrier hangs on second attempt but only when > > > > multiple hosts used. (Ralph Castain) > > > > > > > > > > > > > ---------------------------------------------------------------------- > > > > > > > > Message: 1 > > > > Date: Fri, 2 May 2014 16:24:04 -0500 > > > > From: Clay Kirkland <clay.kirkl...@versityinc.com> > > > > To: us...@open-mpi.org > > > > Subject: [OMPI users] MPI_Barrier hangs on second attempt but only > > > > when multiple hosts used. > > > > Message-ID: > > > > <CAJDnjA8Wi=FEjz6Vz+Bc34b+nFE= > > > tf4b7g0bqgmbekg7h-p...@mail.gmail.com> > > > > Content-Type: text/plain; charset="utf-8" > > > > > > > > I have been using MPI for many many years so I have very well > debugged > > > mpi > > > > tests. I am > > > > having trouble on either openmpi-1.4.5 or openmpi-1.6.5 versions > though > > > > with getting the > > > > MPI_Barrier calls to work. It works fine when I run all processes > on > > > one > > > > machine but when > > > > I run with two or more hosts the second call to MPI_Barrier always > hangs. > > > > Not the first one, > > > > but always the second one. I looked at FAQ's and such but found > nothing > > > > except for a comment > > > > that MPI_Barrier problems were often problems with fire walls. Also > > > > mentioned as a problem > > > > was not having the same version of mpi on both machines. I turned > > > > firewalls off and removed > > > > and reinstalled the same version on both hosts but I still see the > same > > > > thing. I then installed > > > > lam mpi on two of my machines and that works fine. I can call the > > > > MPI_Barrier function when run on > > > > one of two machines by itself many times with no hangs. Only hangs > if > > > two > > > > or more hosts are involved. > > > > These runs are all being done on CentOS release 6.4. Here is test > > > program > > > > I used. > > > > > > > > main (argc, argv) > > > > int argc; > > > > char **argv; > > > > { > > > > char message[20]; > > > > char hoster[256]; > > > > char nameis[256]; > > > > int fd, i, j, jnp, iret, myrank, np, ranker, recker; > > > > MPI_Comm comm; > > > > MPI_Status status; > > > > > > > > MPI_Init( &argc, &argv ); > > > > MPI_Comm_rank( MPI_COMM_WORLD, &myrank); > > > > MPI_Comm_size( MPI_COMM_WORLD, &np); > > > > > > > > gethostname(hoster,256); > > > > > > > > printf(" In rank %d and host= %s Do Barrier call > > > > 1.\n",myrank,hoster); > > > > MPI_Barrier(MPI_COMM_WORLD); > > > > printf(" In rank %d and host= %s Do Barrier call > > > > 2.\n",myrank,hoster); > > > > MPI_Barrier(MPI_COMM_WORLD); > > > > printf(" In rank %d and host= %s Do Barrier call > > > > 3.\n",myrank,hoster); > > > > MPI_Barrier(MPI_COMM_WORLD); > > > > MPI_Finalize(); > > > > exit(0); > > > > } > > > > > > > > Here are three runs of test program. First with two processes on > one > > > > host, then with > > > > two processes on another host, and finally with one process on each > of > > > two > > > > hosts. The > > > > first two runs are fine but the last run hangs on the second > MPI_Barrier. > > > > > > > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out > > > > In rank 0 and host= centos Do Barrier call 1. > > > > In rank 1 and host= centos Do Barrier call 1. > > > > In rank 1 and host= centos Do Barrier call 2. > > > > In rank 1 and host= centos Do Barrier call 3. > > > > In rank 0 and host= centos Do Barrier call 2. > > > > In rank 0 and host= centos Do Barrier call 3. > > > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out > > > > /root/.bashrc: line 14: unalias: ls: not found > > > > In rank 0 and host= RAID Do Barrier call 1. > > > > In rank 0 and host= RAID Do Barrier call 2. > > > > In rank 0 and host= RAID Do Barrier call 3. > > > > In rank 1 and host= RAID Do Barrier call 1. > > > > In rank 1 and host= RAID Do Barrier call 2. > > > > In rank 1 and host= RAID Do Barrier call 3. > > > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos,RAID > a.out > > > > /root/.bashrc: line 14: unalias: ls: not found > > > > In rank 0 and host= centos Do Barrier call 1. > > > > In rank 0 and host= centos Do Barrier call 2. > > > > In rank 1 and host= RAID Do Barrier call 1. > > > > In rank 1 and host= RAID Do Barrier call 2. > > > > > > > > Since it is such a simple test and problem and such a widely used > MPI > > > > function, it must obviously > > > > be an installation or configuration problem. A pstack for each of > the > > > > hung MPI_Barrier processes > > > > on the two machines shows this: > > > > > > > > [root@centos ~]# pstack 31666 > > > > #0 0x0000003baf0e8ee3 in __epoll_wait_nocancel () from > /lib64/libc.so.6 > > > > #1 0x00007f5de06125eb in epoll_dispatch () from > > > /usr/local/lib/libmpi.so.1 > > > > #2 0x00007f5de061475a in opal_event_base_loop () from > > > > /usr/local/lib/libmpi.so.1 > > > > #3 0x00007f5de0639229 in opal_progress () from > > > /usr/local/lib/libmpi.so.1 > > > > #4 0x00007f5de0586f75 in ompi_request_default_wait_all () from > > > > /usr/local/lib/libmpi.so.1 > > > > #5 0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from > > > > /usr/local/lib/openmpi/mca_coll_tuned.so > > > > #6 0x00007f5ddc59d8ff in ompi_coll_tuned_barrier_intra_two_procs () > from > > > > /usr/local/lib/openmpi/mca_coll_tuned.so > > > > #7 0x00007f5de05941c2 in PMPI_Barrier () from > /usr/local/lib/libmpi.so.1 > > > > #8 0x0000000000400a43 in main () > > > > > > > > [root@RAID openmpi-1.6.5]# pstack 22167 > > > > #0 0x00000030302e8ee3 in __epoll_wait_nocancel () from > /lib64/libc.so.6 > > > > #1 0x00007f7ee46885eb in epoll_dispatch () from > > > /usr/local/lib/libmpi.so.1 > > > > #2 0x00007f7ee468a75a in opal_event_base_loop () from > > > > /usr/local/lib/libmpi.so.1 > > > > #3 0x00007f7ee46af229 in opal_progress () from > > > /usr/local/lib/libmpi.so.1 > > > > #4 0x00007f7ee45fcf75 in ompi_request_default_wait_all () from > > > > /usr/local/lib/libmpi.so.1 > > > > #5 0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from > > > > /usr/local/lib/openmpi/mca_coll_tuned.so > > > > #6 0x00007f7ee06138ff in ompi_coll_tuned_barrier_intra_two_procs () > from > > > > /usr/local/lib/openmpi/mca_coll_tuned.so > > > > #7 0x00007f7ee460a1c2 in PMPI_Barrier () from > /usr/local/lib/libmpi.so.1 > > > > #8 0x0000000000400a43 in main () > > > > > > > > Which looks exactly the same on each machine. Any thoughts or ideas > > > would > > > > be greatly appreciated as > > > > I am stuck. > > > > > > > > Clay Kirkland > > > > -------------- next part -------------- > > > > HTML attachment scrubbed and removed > > > > > > > > ------------------------------ > > > > > > > > Message: 2 > > > > Date: Sat, 3 May 2014 06:39:20 -0700 > > > > From: Ralph Castain <r...@open-mpi.org> > > > > To: Open MPI Users <us...@open-mpi.org> > > > > Subject: Re: [OMPI users] MPI_Barrier hangs on second attempt but > only > > > > when multiple hosts used. > > > > Message-ID: <3cf53d73-15d9-40bb-a2de-50ba3561a...@open-mpi.org> > > > > Content-Type: text/plain; charset="us-ascii" > > > > > > > > Hmmm...just testing on my little cluster here on two nodes, it works > > > just fine with 1.8.2: > > > > > > > > [rhc@bend001 v1.8]$ mpirun -n 2 --map-by node ./a.out > > > > In rank 0 and host= bend001 Do Barrier call 1. > > > > In rank 0 and host= bend001 Do Barrier call 2. > > > > In rank 0 and host= bend001 Do Barrier call 3. > > > > In rank 1 and host= bend002 Do Barrier call 1. > > > > In rank 1 and host= bend002 Do Barrier call 2. > > > > In rank 1 and host= bend002 Do Barrier call 3. > > > > [rhc@bend001 v1.8]$ > > > > > > > > > > > > How are you configuring OMPI? > > > > > > > > > > > > On May 2, 2014, at 2:24 PM, Clay Kirkland < > clay.kirkl...@versityinc.com> > > > wrote: > > > > > > > > > I have been using MPI for many many years so I have very well > debugged > > > mpi tests. I am > > > > > having trouble on either openmpi-1.4.5 or openmpi-1.6.5 versions > > > though with getting the > > > > > MPI_Barrier calls to work. It works fine when I run all > processes on > > > one machine but when > > > > > I run with two or more hosts the second call to MPI_Barrier always > > > hangs. Not the first one, > > > > > but always the second one. I looked at FAQ's and such but found > > > nothing except for a comment > > > > > that MPI_Barrier problems were often problems with fire walls. > Also > > > mentioned as a problem > > > > > was not having the same version of mpi on both machines. I turned > > > firewalls off and removed > > > > > and reinstalled the same version on both hosts but I still see the > > > same thing. I then installed > > > > > lam mpi on two of my machines and that works fine. I can call the > > > MPI_Barrier function when run on > > > > > one of two machines by itself many times with no hangs. Only > hangs > > > if two or more hosts are involved. > > > > > These runs are all being done on CentOS release 6.4. Here is test > > > program I used. > > > > > > > > > > main (argc, argv) > > > > > int argc; > > > > > char **argv; > > > > > { > > > > > char message[20]; > > > > > char hoster[256]; > > > > > char nameis[256]; > > > > > int fd, i, j, jnp, iret, myrank, np, ranker, recker; > > > > > MPI_Comm comm; > > > > > MPI_Status status; > > > > > > > > > > MPI_Init( &argc, &argv ); > > > > > MPI_Comm_rank( MPI_COMM_WORLD, &myrank); > > > > > MPI_Comm_size( MPI_COMM_WORLD, &np); > > > > > > > > > > gethostname(hoster,256); > > > > > > > > > > printf(" In rank %d and host= %s Do Barrier call > > > 1.\n",myrank,hoster); > > > > > MPI_Barrier(MPI_COMM_WORLD); > > > > > printf(" In rank %d and host= %s Do Barrier call > > > 2.\n",myrank,hoster); > > > > > MPI_Barrier(MPI_COMM_WORLD); > > > > > printf(" In rank %d and host= %s Do Barrier call > > > 3.\n",myrank,hoster); > > > > > MPI_Barrier(MPI_COMM_WORLD); > > > > > MPI_Finalize(); > > > > > exit(0); > > > > > } > > > > > > > > > > Here are three runs of test program. First with two processes on > > > one host, then with > > > > > two processes on another host, and finally with one process on > each of > > > two hosts. The > > > > > first two runs are fine but the last run hangs on the second > > > MPI_Barrier. > > > > > > > > > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out > > > > > In rank 0 and host= centos Do Barrier call 1. > > > > > In rank 1 and host= centos Do Barrier call 1. > > > > > In rank 1 and host= centos Do Barrier call 2. > > > > > In rank 1 and host= centos Do Barrier call 3. > > > > > In rank 0 and host= centos Do Barrier call 2. > > > > > In rank 0 and host= centos Do Barrier call 3. > > > > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out > > > > > /root/.bashrc: line 14: unalias: ls: not found > > > > > In rank 0 and host= RAID Do Barrier call 1. > > > > > In rank 0 and host= RAID Do Barrier call 2. > > > > > In rank 0 and host= RAID Do Barrier call 3. > > > > > In rank 1 and host= RAID Do Barrier call 1. > > > > > In rank 1 and host= RAID Do Barrier call 2. > > > > > In rank 1 and host= RAID Do Barrier call 3. > > > > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos,RAID > > > a.out > > > > > /root/.bashrc: line 14: unalias: ls: not found > > > > > In rank 0 and host= centos Do Barrier call 1. > > > > > In rank 0 and host= centos Do Barrier call 2. > > > > > In rank 1 and host= RAID Do Barrier call 1. > > > > > In rank 1 and host= RAID Do Barrier call 2. > > > > > > > > > > Since it is such a simple test and problem and such a widely used > > > MPI function, it must obviously > > > > > be an installation or configuration problem. A pstack for each of > > > the hung MPI_Barrier processes > > > > > on the two machines shows this: > > > > > > > > > > [root@centos ~]# pstack 31666 > > > > > #0 0x0000003baf0e8ee3 in __epoll_wait_nocancel () from > > > /lib64/libc.so.6 > > > > > #1 0x00007f5de06125eb in epoll_dispatch () from > > > /usr/local/lib/libmpi.so.1 > > > > > #2 0x00007f5de061475a in opal_event_base_loop () from > > > /usr/local/lib/libmpi.so.1 > > > > > #3 0x00007f5de0639229 in opal_progress () from > > > /usr/local/lib/libmpi.so.1 > > > > > #4 0x00007f5de0586f75 in ompi_request_default_wait_all () from > > > /usr/local/lib/libmpi.so.1 > > > > > #5 0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from > > > /usr/local/lib/openmpi/mca_coll_tuned.so > > > > > #6 0x00007f5ddc59d8ff in ompi_coll_tuned_barrier_intra_two_procs > () > > > from /usr/local/lib/openmpi/mca_coll_tuned.so > > > > > #7 0x00007f5de05941c2 in PMPI_Barrier () from > > > /usr/local/lib/libmpi.so.1 > > > > > #8 0x0000000000400a43 in main () > > > > > > > > > > [root@RAID openmpi-1.6.5]# pstack 22167 > > > > > #0 0x00000030302e8ee3 in __epoll_wait_nocancel () from > > > /lib64/libc.so.6 > > > > > #1 0x00007f7ee46885eb in epoll_dispatch () from > > > /usr/local/lib/libmpi.so.1 > > > > > #2 0x00007f7ee468a75a in opal_event_base_loop () from > > > /usr/local/lib/libmpi.so.1 > > > > > #3 0x00007f7ee46af229 in opal_progress () from > > > /usr/local/lib/libmpi.so.1 > > > > > #4 0x00007f7ee45fcf75 in ompi_request_default_wait_all () from > > > /usr/local/lib/libmpi.so.1 > > > > > #5 0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from > > > /usr/local/lib/openmpi/mca_coll_tuned.so > > > > > #6 0x00007f7ee06138ff in ompi_coll_tuned_barrier_intra_two_procs > () > > > from /usr/local/lib/openmpi/mca_coll_tuned.so > > > > > #7 0x00007f7ee460a1c2 in PMPI_Barrier () from > > > /usr/local/lib/libmpi.so.1 > > > > > #8 0x0000000000400a43 in main () > > > > > > > > > > Which looks exactly the same on each machine. Any thoughts or > ideas > > > would be greatly appreciated as > > > > > I am stuck. > > > > > > > > > > Clay Kirkland > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > users mailing list > > > > > us...@open-mpi.org > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > -------------- next part -------------- > > > > HTML attachment scrubbed and removed > > > > > > > > ------------------------------ > > > > > > > > Subject: Digest Footer > > > > > > > > _______________________________________________ > > > > users mailing list > > > > us...@open-mpi.org > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > ------------------------------ > > > > > > > > End of users Digest, Vol 2879, Issue 1 > > > > ************************************** > > > > > > > > _______________________________________________ > > > > users mailing list > > > > us...@open-mpi.org > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > -- > > > Jeff Squyres > > > jsquy...@cisco.com > > > For corporate legal information go to: > > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > > > > > > > ------------------------------ > > > > > > Subject: Digest Footer > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > ------------------------------ > > > > > > End of users Digest, Vol 2881, Issue 1 > > > ************************************** > > > > > -------------- next part -------------- > > HTML attachment scrubbed and removed > > > > ------------------------------ > > > > Message: 2 > > Date: Tue, 6 May 2014 18:50:50 -0500 > > From: Clay Kirkland <clay.kirkl...@versityinc.com> > > To: us...@open-mpi.org > > Subject: Re: [OMPI users] users Digest, Vol 2881, Issue 1 > > Message-ID: > > < > cajdnja-u4btpto+87czsho81t+-a1jzottc7mwdfiar7+vz...@mail.gmail.com> > > Content-Type: text/plain; charset="utf-8" > > > > Well it turns out I can't seem to get all three of my machines on the > > same page. > > Two of them are using eth0 and one is using eth1. Centos seems unable > to > > bring > > up multiple network interfaces for some reason and when I use the mca > param > > to > > use eth0 it works on two machines but not the other. Is there some way > to > > use > > only eth1 on one host and only eth0 on the other two? Maybe environment > > variables > > but I can't seem to get that to work either. > > > > Clay > > > > > > On Tue, May 6, 2014 at 6:28 PM, Clay Kirkland > > <clay.kirkl...@versityinc.com>wrote: > > > > > That last trick seems to work. I can get it to work once in a while > with > > > those tcp options but it is > > > tricky as I have three machines and two of them use eth0 as primary > > > network interface and one > > > uses eth1. But by fiddling with network options and perhaps moving a > > > cable or two I think I can > > > get it all to work Thanks much for the tip. > > > > > > Clay > > > > > > > > > On Tue, May 6, 2014 at 11:00 AM, <users-requ...@open-mpi.org> wrote: > > > > > >> Send users mailing list submissions to > > >> us...@open-mpi.org > > >> > > >> To subscribe or unsubscribe via the World Wide Web, visit > > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> or, via email, send a message with subject or body 'help' to > > >> users-requ...@open-mpi.org > > >> > > >> You can reach the person managing the list at > > >> users-ow...@open-mpi.org > > >> > > >> When replying, please edit your Subject line so it is more specific > > >> than "Re: Contents of users digest..." > > >> > > >> > > >> Today's Topics: > > >> > > >> 1. Re: MPI_Barrier hangs on second attempt but only when > > >> multiple hosts used. (Daniels, Marcus G) > > >> 2. ROMIO bug reading darrays (Richard Shaw) > > >> 3. MPI File Open does not work (Imran Ali) > > >> 4. Re: MPI File Open does not work (Jeff Squyres (jsquyres)) > > >> 5. Re: MPI File Open does not work (Imran Ali) > > >> 6. Re: MPI File Open does not work (Jeff Squyres (jsquyres)) > > >> 7. Re: MPI File Open does not work (Imran Ali) > > >> 8. Re: MPI File Open does not work (Jeff Squyres (jsquyres)) > > >> 9. Re: users Digest, Vol 2879, Issue 1 (Jeff Squyres (jsquyres)) > > >> > > >> > > >> ---------------------------------------------------------------------- > > >> > > >> Message: 1 > > >> Date: Mon, 5 May 2014 19:28:07 +0000 > > >> From: "Daniels, Marcus G" <mdani...@lanl.gov> > > >> To: "'us...@open-mpi.org'" <us...@open-mpi.org> > > >> Subject: Re: [OMPI users] MPI_Barrier hangs on second attempt but only > > >> when multiple hosts used. > > >> Message-ID: > > >> < > > >> 532c594b7920a549a2a91cb4312cc57640dc5...@ecs-exg-p-mb01.win.lanl.gov> > > >> Content-Type: text/plain; charset="utf-8" > > >> > > >> > > >> > > >> From: Clay Kirkland [mailto:clay.kirkl...@versityinc.com] > > >> Sent: Friday, May 02, 2014 03:24 PM > > >> To: us...@open-mpi.org <us...@open-mpi.org> > > >> Subject: [OMPI users] MPI_Barrier hangs on second attempt but only > when > > >> multiple hosts used. > > >> > > >> I have been using MPI for many many years so I have very well debugged > > >> mpi tests. I am > > >> having trouble on either openmpi-1.4.5 or openmpi-1.6.5 versions > though > > >> with getting the > > >> MPI_Barrier calls to work. It works fine when I run all processes on > > >> one machine but when > > >> I run with two or more hosts the second call to MPI_Barrier always > hangs. > > >> Not the first one, > > >> but always the second one. I looked at FAQ's and such but found > nothing > > >> except for a comment > > >> that MPI_Barrier problems were often problems with fire walls. Also > > >> mentioned as a problem > > >> was not having the same version of mpi on both machines. I turned > > >> firewalls off and removed > > >> and reinstalled the same version on both hosts but I still see the > same > > >> thing. I then installed > > >> lam mpi on two of my machines and that works fine. I can call the > > >> MPI_Barrier function when run on > > >> one of two machines by itself many times with no hangs. Only hangs > if > > >> two or more hosts are involved. > > >> These runs are all being done on CentOS release 6.4. Here is test > > >> program I used. > > >> > > >> main (argc, argv) > > >> int argc; > > >> char **argv; > > >> { > > >> char message[20]; > > >> char hoster[256]; > > >> char nameis[256]; > > >> int fd, i, j, jnp, iret, myrank, np, ranker, recker; > > >> MPI_Comm comm; > > >> MPI_Status status; > > >> > > >> MPI_Init( &argc, &argv ); > > >> MPI_Comm_rank( MPI_COMM_WORLD, &myrank); > > >> MPI_Comm_size( MPI_COMM_WORLD, &np); > > >> > > >> gethostname(hoster,256); > > >> > > >> printf(" In rank %d and host= %s Do Barrier call > > >> 1.\n",myrank,hoster); > > >> MPI_Barrier(MPI_COMM_WORLD); > > >> printf(" In rank %d and host= %s Do Barrier call > > >> 2.\n",myrank,hoster); > > >> MPI_Barrier(MPI_COMM_WORLD); > > >> printf(" In rank %d and host= %s Do Barrier call > > >> 3.\n",myrank,hoster); > > >> MPI_Barrier(MPI_COMM_WORLD); > > >> MPI_Finalize(); > > >> exit(0); > > >> } > > >> > > >> Here are three runs of test program. First with two processes on > one > > >> host, then with > > >> two processes on another host, and finally with one process on each of > > >> two hosts. The > > >> first two runs are fine but the last run hangs on the second > MPI_Barrier. > > >> > > >> [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out > > >> In rank 0 and host= centos Do Barrier call 1. > > >> In rank 1 and host= centos Do Barrier call 1. > > >> In rank 1 and host= centos Do Barrier call 2. > > >> In rank 1 and host= centos Do Barrier call 3. > > >> In rank 0 and host= centos Do Barrier call 2. > > >> In rank 0 and host= centos Do Barrier call 3. > > >> [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out > > >> /root/.bashrc: line 14: unalias: ls: not found > > >> In rank 0 and host= RAID Do Barrier call 1. > > >> In rank 0 and host= RAID Do Barrier call 2. > > >> In rank 0 and host= RAID Do Barrier call 3. > > >> In rank 1 and host= RAID Do Barrier call 1. > > >> In rank 1 and host= RAID Do Barrier call 2. > > >> In rank 1 and host= RAID Do Barrier call 3. > > >> [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos,RAID > a.out > > >> /root/.bashrc: line 14: unalias: ls: not found > > >> In rank 0 and host= centos Do Barrier call 1. > > >> In rank 0 and host= centos Do Barrier call 2. > > >> In rank 1 and host= RAID Do Barrier call 1. > > >> In rank 1 and host= RAID Do Barrier call 2. > > >> > > >> Since it is such a simple test and problem and such a widely used > MPI > > >> function, it must obviously > > >> be an installation or configuration problem. A pstack for each of > the > > >> hung MPI_Barrier processes > > >> on the two machines shows this: > > >> > > >> [root@centos ~]# pstack 31666 > > >> #0 0x0000003baf0e8ee3 in __epoll_wait_nocancel () from > /lib64/libc.so.6 > > >> #1 0x00007f5de06125eb in epoll_dispatch () from > > >> /usr/local/lib/libmpi.so.1 > > >> #2 0x00007f5de061475a in opal_event_base_loop () from > > >> /usr/local/lib/libmpi.so.1 > > >> #3 0x00007f5de0639229 in opal_progress () from > /usr/local/lib/libmpi.so.1 > > >> #4 0x00007f5de0586f75 in ompi_request_default_wait_all () from > > >> /usr/local/lib/libmpi.so.1 > > >> #5 0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from > > >> /usr/local/lib/openmpi/mca_coll_tuned.so > > >> #6 0x00007f5ddc59d8ff in ompi_coll_tuned_barrier_intra_two_procs () > from > > >> /usr/local/lib/openmpi/mca_coll_tuned.so > > >> #7 0x00007f5de05941c2 in PMPI_Barrier () from > /usr/local/lib/libmpi.so.1 > > >> #8 0x0000000000400a43 in main () > > >> > > >> [root@RAID openmpi-1.6.5]# pstack 22167 > > >> #0 0x00000030302e8ee3 in __epoll_wait_nocancel () from > /lib64/libc.so.6 > > >> #1 0x00007f7ee46885eb in epoll_dispatch () from > > >> /usr/local/lib/libmpi.so.1 > > >> #2 0x00007f7ee468a75a in opal_event_base_loop () from > > >> /usr/local/lib/libmpi.so.1 > > >> #3 0x00007f7ee46af229 in opal_progress () from > /usr/local/lib/libmpi.so.1 > > >> #4 0x00007f7ee45fcf75 in ompi_request_default_wait_all () from > > >> /usr/local/lib/libmpi.so.1 > > >> #5 0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from > > >> /usr/local/lib/openmpi/mca_coll_tuned.so > > >> #6 0x00007f7ee06138ff in ompi_coll_tuned_barrier_intra_two_procs () > from > > >> /usr/local/lib/openmpi/mca_coll_tuned.so > > >> #7 0x00007f7ee460a1c2 in PMPI_Barrier () from > /usr/local/lib/libmpi.so.1 > > >> #8 0x0000000000400a43 in main () > > >> > > >> Which looks exactly the same on each machine. Any thoughts or ideas > > >> would be greatly appreciated as > > >> I am stuck. > > >> > > >> Clay Kirkland > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> -------------- next part -------------- > > >> HTML attachment scrubbed and removed > > >> > > >> ------------------------------ > > >> > > >> Message: 2 > > >> Date: Mon, 5 May 2014 22:20:59 -0400 > > >> From: Richard Shaw <jr...@cita.utoronto.ca> > > >> To: Open MPI Users <us...@open-mpi.org> > > >> Subject: [OMPI users] ROMIO bug reading darrays > > >> Message-ID: > > >> < > > >> can+evmkc+9kacnpausscziufwdj3jfcsymb-8zdx1etdkab...@mail.gmail.com> > > >> Content-Type: text/plain; charset="utf-8" > > >> > > >> Hello, > > >> > > >> I think I've come across a bug when using ROMIO to read in a 2D > > >> distributed > > >> array. I've attached a test case to this email. > > >> > > >> In the testcase I first initialise an array of 25 doubles (which will > be a > > >> 5x5 grid), then I create a datatype representing a 5x5 matrix > distributed > > >> in 3x3 blocks over a 2x2 process grid. As a reference I use MPI_Pack > to > > >> pull out the block cyclic array elements local to each process (which > I > > >> think is correct). Then I write the original array of 25 doubles to > disk, > > >> and use MPI-IO to read it back in (performing the Open, Set_view, and > > >> Real_all), and compare to the reference. > > >> > > >> Running this with OMPI, the two match on all ranks. > > >> > > >> > mpirun -mca io ompio -np 4 ./darr_read.x > > >> === Rank 0 === (9 elements) > > >> Packed: 0.0 1.0 2.0 5.0 6.0 7.0 10.0 11.0 12.0 > > >> Read: 0.0 1.0 2.0 5.0 6.0 7.0 10.0 11.0 12.0 > > >> > > >> === Rank 1 === (6 elements) > > >> Packed: 15.0 16.0 17.0 20.0 21.0 22.0 > > >> Read: 15.0 16.0 17.0 20.0 21.0 22.0 > > >> > > >> === Rank 2 === (6 elements) > > >> Packed: 3.0 4.0 8.0 9.0 13.0 14.0 > > >> Read: 3.0 4.0 8.0 9.0 13.0 14.0 > > >> > > >> === Rank 3 === (4 elements) > > >> Packed: 18.0 19.0 23.0 24.0 > > >> Read: 18.0 19.0 23.0 24.0 > > >> > > >> > > >> > > >> However, using ROMIO the two differ on two of the ranks: > > >> > > >> > mpirun -mca io romio -np 4 ./darr_read.x > > >> === Rank 0 === (9 elements) > > >> Packed: 0.0 1.0 2.0 5.0 6.0 7.0 10.0 11.0 12.0 > > >> Read: 0.0 1.0 2.0 5.0 6.0 7.0 10.0 11.0 12.0 > > >> > > >> === Rank 1 === (6 elements) > > >> Packed: 15.0 16.0 17.0 20.0 21.0 22.0 > > >> Read: 0.0 1.0 2.0 0.0 1.0 2.0 > > >> > > >> === Rank 2 === (6 elements) > > >> Packed: 3.0 4.0 8.0 9.0 13.0 14.0 > > >> Read: 3.0 4.0 8.0 9.0 13.0 14.0 > > >> > > >> === Rank 3 === (4 elements) > > >> Packed: 18.0 19.0 23.0 24.0 > > >> Read: 0.0 1.0 0.0 1.0 > > >> > > >> > > >> > > >> My interpretation is that the behaviour with OMPIO is correct. > > >> Interestingly everything matches up using both ROMIO and OMPIO if I > set > > >> the > > >> block shape to 2x2. > > >> > > >> This was run on OS X using 1.8.2a1r31632. I have also run this on > Linux > > >> with OpenMPI 1.7.4, and OMPIO is still correct, but using ROMIO I > just get > > >> segfaults. > > >> > > >> Thanks, > > >> Richard > > >> -------------- next part -------------- > > >> HTML attachment scrubbed and removed > > >> -------------- next part -------------- > > >> A non-text attachment was scrubbed... > > >> Name: darr_read.c > > >> Type: text/x-csrc > > >> Size: 2218 bytes > > >> Desc: not available > > >> URL: < > > >> > http://www.open-mpi.org/MailArchives/users/attachments/20140505/5a5ab0ba/attachment.bin > > >> > > > >> > > >> ------------------------------ > > >> > > >> Message: 3 > > >> Date: Tue, 06 May 2014 13:24:35 +0200 > > >> From: Imran Ali <imra...@student.matnat.uio.no> > > >> To: <us...@open-mpi.org> > > >> Subject: [OMPI users] MPI File Open does not work > > >> Message-ID: <d57bdf499c00360b737205b085c50...@ulrik.uio.no> > > >> Content-Type: text/plain; charset="utf-8" > > >> > > >> > > >> > > >> I get the following error when I try to run the following python > > >> code > > >> import mpi4py.MPI as MPI > > >> comm = MPI.COMM_WORLD > > >> > > >> MPI.File.Open(comm,"some.file") > > >> > > >> $ mpirun -np 1 python > > >> test_mpi.py > > >> Traceback (most recent call last): > > >> File "test_mpi.py", line > > >> 3, in <module> > > >> MPI.File.Open(comm," h5ex_d_alloc.h5") > > >> File "File.pyx", > > >> line 67, in mpi4py.MPI.File.Open > > >> (src/mpi4py.MPI.c:89639) > > >> mpi4py.MPI.Exception: MPI_ERR_OTHER: known > > >> error not in > > >> list > > >> > -------------------------------------------------------------------------- > > >> mpirun > > >> noticed that the job aborted, but has no info as to the process > > >> that > > >> caused that > > >> situation. > > >> > -------------------------------------------------------------------------- > > >> > > >> > > >> My mpirun version is (Open MPI) 1.6.2. I installed openmpi using the > > >> dorsal script (https://github.com/FEniCS/dorsal) for Redhat > Enterprise 6 > > >> (OS I am using, release 6.5) . It configured the build as following : > > >> > > >> > > >> ./configure --enable-mpi-thread-multiple --enable-opal-multi-threads > > >> --with-threads=posix --disable-mpi-profile > > >> > > >> I need emphasize that I do > > >> not have root acces on the system I am running my application. > > >> > > >> Imran > > >> > > >> > > >> > > >> -------------- next part -------------- > > >> HTML attachment scrubbed and removed > > >> > > >> ------------------------------ > > >> > > >> Message: 4 > > >> Date: Tue, 6 May 2014 12:56:04 +0000 > > >> From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > > >> To: Open MPI Users <us...@open-mpi.org> > > >> Subject: Re: [OMPI users] MPI File Open does not work > > >> Message-ID: <e7df28cb-d4fb-4087-928e-18e61d1d2...@cisco.com> > > >> Content-Type: text/plain; charset="us-ascii" > > >> > > >> The thread support in the 1.6 series is not very good. You might try: > > >> > > >> - Upgrading to 1.6.5 > > >> - Or better yet, upgrading to 1.8.1 > > >> > > >> > > >> On May 6, 2014, at 7:24 AM, Imran Ali <imra...@student.matnat.uio.no> > > >> wrote: > > >> > > >> > I get the following error when I try to run the following python > code > > >> > > > >> > import mpi4py.MPI as MPI > > >> > comm = MPI.COMM_WORLD > > >> > MPI.File.Open(comm,"some.file") > > >> > > > >> > $ mpirun -np 1 python test_mpi.py > > >> > Traceback (most recent call last): > > >> > File "test_mpi.py", line 3, in <module> > > >> > MPI.File.Open(comm," h5ex_d_alloc.h5") > > >> > File "File.pyx", line 67, in mpi4py.MPI.File.Open > > >> (src/mpi4py.MPI.c:89639) > > >> > mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list > > >> > > > >> > -------------------------------------------------------------------------- > > >> > mpirun noticed that the job aborted, but has no info as to the > process > > >> > that caused that situation. > > >> > > > >> > -------------------------------------------------------------------------- > > >> > > > >> > My mpirun version is (Open MPI) 1.6.2. I installed openmpi using the > > >> dorsal script (https://github.com/FEniCS/dorsal) for Redhat > Enterprise 6 > > >> (OS I am using, release 6.5) . It configured the build as following : > > >> > > > >> > ./configure --enable-mpi-thread-multiple --enable-opal-multi-threads > > >> --with-threads=posix --disable-mpi-profile > > >> > > > >> > I need emphasize that I do not have root acces on the system I am > > >> running my application. > > >> > > > >> > Imran > > >> > > > >> > > > >> > _______________________________________________ > > >> > users mailing list > > >> > us...@open-mpi.org > > >> > http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> > > >> > > >> -- > > >> Jeff Squyres > > >> jsquy...@cisco.com > > >> For corporate legal information go to: > > >> http://www.cisco.com/web/about/doing_business/legal/cri/ > > >> > > >> > > >> > > >> ------------------------------ > > >> > > >> Message: 5 > > >> Date: Tue, 6 May 2014 15:32:21 +0200 > > >> From: Imran Ali <imra...@student.matnat.uio.no> > > >> To: Open MPI Users <us...@open-mpi.org> > > >> Subject: Re: [OMPI users] MPI File Open does not work > > >> Message-ID: <fa6dffff-6c66-4a47-84fc-148fb51ce...@math.uio.no> > > >> Content-Type: text/plain; charset=us-ascii > > >> > > >> > > >> 6. mai 2014 kl. 14:56 skrev Jeff Squyres (jsquyres) < > jsquy...@cisco.com>: > > >> > > >> > The thread support in the 1.6 series is not very good. You might > try: > > >> > > > >> > - Upgrading to 1.6.5 > > >> > - Or better yet, upgrading to 1.8.1 > > >> > > > >> > > >> I will attempt that than. I read at > > >> > > >> http://www.open-mpi.org/faq/?category=building#install-overwrite > > >> > > >> that I should completely uninstall my previous version. Could you > > >> recommend to me how I can go about doing it (without root access). > > >> I am uncertain where I can use make uninstall. > > >> > > >> Imran > > >> > > >> > > > >> > On May 6, 2014, at 7:24 AM, Imran Ali < > imra...@student.matnat.uio.no> > > >> wrote: > > >> > > > >> >> I get the following error when I try to run the following python > code > > >> >> > > >> >> import mpi4py.MPI as MPI > > >> >> comm = MPI.COMM_WORLD > > >> >> MPI.File.Open(comm,"some.file") > > >> >> > > >> >> $ mpirun -np 1 python test_mpi.py > > >> >> Traceback (most recent call last): > > >> >> File "test_mpi.py", line 3, in <module> > > >> >> MPI.File.Open(comm," h5ex_d_alloc.h5") > > >> >> File "File.pyx", line 67, in mpi4py.MPI.File.Open > > >> (src/mpi4py.MPI.c:89639) > > >> >> mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list > > >> >> > > >> > -------------------------------------------------------------------------- > > >> >> mpirun noticed that the job aborted, but has no info as to the > process > > >> >> that caused that situation. > > >> >> > > >> > -------------------------------------------------------------------------- > > >> >> > > >> >> My mpirun version is (Open MPI) 1.6.2. I installed openmpi using > the > > >> dorsal script (https://github.com/FEniCS/dorsal) for Redhat > Enterprise 6 > > >> (OS I am using, release 6.5) . It configured the build as following : > > >> >> > > >> >> ./configure --enable-mpi-thread-multiple > --enable-opal-multi-threads > > >> --with-threads=posix --disable-mpi-profile > > >> >> > > >> >> I need emphasize that I do not have root acces on the system I am > > >> running my application. > > >> >> > > >> >> Imran > > >> >> > > >> >> > > >> >> _______________________________________________ > > >> >> users mailing list > > >> >> us...@open-mpi.org > > >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> > > > >> > > > >> > -- > > >> > Jeff Squyres > > >> > jsquy...@cisco.com > > >> > For corporate legal information go to: > > >> http://www.cisco.com/web/about/doing_business/legal/cri/ > > >> > > > >> > _______________________________________________ > > >> > users mailing list > > >> > us...@open-mpi.org > > >> > http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> > > >> > > >> > > >> ------------------------------ > > >> > > >> Message: 6 > > >> Date: Tue, 6 May 2014 13:34:38 +0000 > > >> From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > > >> To: Open MPI Users <us...@open-mpi.org> > > >> Subject: Re: [OMPI users] MPI File Open does not work > > >> Message-ID: <2a933c0e-80f6-4ded-b44c-53b5f37ef...@cisco.com> > > >> Content-Type: text/plain; charset="us-ascii" > > >> > > >> On May 6, 2014, at 9:32 AM, Imran Ali <imra...@student.matnat.uio.no> > > >> wrote: > > >> > > >> > I will attempt that than. I read at > > >> > > > >> > http://www.open-mpi.org/faq/?category=building#install-overwrite > > >> > > > >> > that I should completely uninstall my previous version. > > >> > > >> Yes, that is best. OR: you can install into a whole separate tree and > > >> ignore the first installation. > > >> > > >> > Could you recommend to me how I can go about doing it (without root > > >> access). > > >> > I am uncertain where I can use make uninstall. > > >> > > >> If you don't have write access into the installation tree (i.e., it > was > > >> installed via root and you don't have root access), then your best > bet is > > >> simply to install into a new tree. E.g., if OMPI is installed into > > >> /opt/openmpi-1.6.2, try installing into /opt/openmpi-1.6.5, or even > > >> $HOME/installs/openmpi-1.6.5, or something like that. > > >> > > >> -- > > >> Jeff Squyres > > >> jsquy...@cisco.com > > >> For corporate legal information go to: > > >> http://www.cisco.com/web/about/doing_business/legal/cri/ > > >> > > >> > > >> > > >> ------------------------------ > > >> > > >> Message: 7 > > >> Date: Tue, 6 May 2014 15:40:34 +0200 > > >> From: Imran Ali <imra...@student.matnat.uio.no> > > >> To: Open MPI Users <us...@open-mpi.org> > > >> Subject: Re: [OMPI users] MPI File Open does not work > > >> Message-ID: <14f0596c-c5c5-4b1a-a4a8-8849d44ab...@math.uio.no> > > >> Content-Type: text/plain; charset=us-ascii > > >> > > >> > > >> 6. mai 2014 kl. 15:34 skrev Jeff Squyres (jsquyres) < > jsquy...@cisco.com>: > > >> > > >> > On May 6, 2014, at 9:32 AM, Imran Ali < > imra...@student.matnat.uio.no> > > >> wrote: > > >> > > > >> >> I will attempt that than. I read at > > >> >> > > >> >> http://www.open-mpi.org/faq/?category=building#install-overwrite > > >> >> > > >> >> that I should completely uninstall my previous version. > > >> > > > >> > Yes, that is best. OR: you can install into a whole separate tree > and > > >> ignore the first installation. > > >> > > > >> >> Could you recommend to me how I can go about doing it (without root > > >> access). > > >> >> I am uncertain where I can use make uninstall. > > >> > > > >> > If you don't have write access into the installation tree (i.e., it > was > > >> installed via root and you don't have root access), then your best > bet is > > >> simply to install into a new tree. E.g., if OMPI is installed into > > >> /opt/openmpi-1.6.2, try installing into /opt/openmpi-1.6.5, or even > > >> $HOME/installs/openmpi-1.6.5, or something like that. > > >> > > >> My install was in my user directory (i.e $HOME). I managed to locate > the > > >> source directory and successfully run make uninstall. > > >> > > >> Will let you know how things went after installation. > > >> > > >> Imran > > >> > > >> > > > >> > -- > > >> > Jeff Squyres > > >> > jsquy...@cisco.com > > >> > For corporate legal information go to: > > >> http://www.cisco.com/web/about/doing_business/legal/cri/ > > >> > > > >> > _______________________________________________ > > >> > users mailing list > > >> > us...@open-mpi.org > > >> > http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> > > >> > > >> > > >> ------------------------------ > > >> > > >> Message: 8 > > >> Date: Tue, 6 May 2014 14:42:52 +0000 > > >> From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > > >> To: Open MPI Users <us...@open-mpi.org> > > >> Subject: Re: [OMPI users] MPI File Open does not work > > >> Message-ID: <710e3328-edaa-4a13-9f07-b45fe3191...@cisco.com> > > >> Content-Type: text/plain; charset="us-ascii" > > >> > > >> On May 6, 2014, at 9:40 AM, Imran Ali <imra...@student.matnat.uio.no> > > >> wrote: > > >> > > >> > My install was in my user directory (i.e $HOME). I managed to locate > > >> the source directory and successfully run make uninstall. > > >> > > >> > > >> FWIW, I usually install Open MPI into its own subdir. E.g., > > >> $HOME/installs/openmpi-x.y.z. Then if I don't want that install any > more, > > >> I can just "rm -rf $HOME/installs/openmpi-x.y.z" -- no need to "make > > >> uninstall". Specifically: if there's nothing else installed in the > same > > >> tree as Open MPI, you can just rm -rf its installation tree. > > >> > > >> -- > > >> Jeff Squyres > > >> jsquy...@cisco.com > > >> For corporate legal information go to: > > >> http://www.cisco.com/web/about/doing_business/legal/cri/ > > >> > > >> > > >> > > >> ------------------------------ > > >> > > >> Message: 9 > > >> Date: Tue, 6 May 2014 14:50:34 +0000 > > >> From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > > >> To: Open MPI Users <us...@open-mpi.org> > > >> Subject: Re: [OMPI users] users Digest, Vol 2879, Issue 1 > > >> Message-ID: <c60aa7e1-96a7-4ccd-9b5b-11a38fb87...@cisco.com> > > >> Content-Type: text/plain; charset="us-ascii" > > >> > > >> Are you using TCP as the MPI transport? > > >> > > >> If so, another thing to try is to limit the IP interfaces that MPI > uses > > >> for its traffic to see if there's some kind of problem with specific > > >> networks. > > >> > > >> For example: > > >> > > >> mpirun --mca btl_tcp_if_include eth0 ... > > >> > > >> If that works, then try adding in any/all other IP interfaces that you > > >> have on your machines. > > >> > > >> A sorta-wild guess: you have some IP interfaces that aren't working, > or > > >> at least, don't work in the way that OMPI wants them to work. So the > first > > >> barrier works because it flows across eth0 (or some other first > network > > >> that, as far as OMPI is concerned, works just fine). But then the > next > > >> barrier round-robin advances to the next IP interface, and it doesn't > work > > >> for some reason. > > >> > > >> We've seen virtual machine bridge interfaces cause problems, for > example. > > >> E.g., if a machine has a Xen virtual machine interface (vibr0, > IIRC?), > > >> then OMPI will try to use it to communicate with peer MPI processes > because > > >> it has a "compatible" IP address, and OMPI think it should be > > >> connected/reachable to peers. If this is the case, you might want to > > >> disable such interfaces and/or use btl_tcp_if_include or > btl_tcp_if_exclude > > >> to select the interfaces that you want to use. > > >> > > >> Pro tip: if you use btl_tcp_if_exclude, remember to exclude the > loopback > > >> interface, too. OMPI defaults to a btl_tcp_if_include="" (blank) and > > >> btl_tcp_if_exclude="lo0". So if you override btl_tcp_if_exclude, you > need > > >> to be sure to *also* include lo0 in the new value. For example: > > >> > > >> mpirun --mca btl_tcp_if_exclude lo0,virb0 ... > > >> > > >> Also, if possible, try upgrading to Open MPI 1.8.1. > > >> > > >> > > >> > > >> On May 4, 2014, at 2:15 PM, Clay Kirkland < > clay.kirkl...@versityinc.com> > > >> wrote: > > >> > > >> > I am configuring with all defaults. Just doing a ./configure and > then > > >> > make and make install. I have used open mpi on several kinds of > > >> > unix systems this way and have had no trouble before. I believe I > > >> > last had success on a redhat version of linux. > > >> > > > >> > > > >> > On Sat, May 3, 2014 at 11:00 AM, <users-requ...@open-mpi.org> > wrote: > > >> > Send users mailing list submissions to > > >> > us...@open-mpi.org > > >> > > > >> > To subscribe or unsubscribe via the World Wide Web, visit > > >> > http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> > or, via email, send a message with subject or body 'help' to > > >> > users-requ...@open-mpi.org > > >> > > > >> > You can reach the person managing the list at > > >> > users-ow...@open-mpi.org > > >> > > > >> > When replying, please edit your Subject line so it is more specific > > >> > than "Re: Contents of users digest..." > > >> > > > >> > > > >> > Today's Topics: > > >> > > > >> > 1. MPI_Barrier hangs on second attempt but only when multiple > > >> > hosts used. (Clay Kirkland) > > >> > 2. Re: MPI_Barrier hangs on second attempt but only when > > >> > multiple hosts used. (Ralph Castain) > > >> > > > >> > > > >> > > ---------------------------------------------------------------------- > > >> > > > >> > Message: 1 > > >> > Date: Fri, 2 May 2014 16:24:04 -0500 > > >> > From: Clay Kirkland <clay.kirkl...@versityinc.com> > > >> > To: us...@open-mpi.org > > >> > Subject: [OMPI users] MPI_Barrier hangs on second attempt but only > > >> > when multiple hosts used. > > >> > Message-ID: > > >> > <CAJDnjA8Wi=FEjz6Vz+Bc34b+nFE= > > >> tf4b7g0bqgmbekg7h-p...@mail.gmail.com> > > >> > Content-Type: text/plain; charset="utf-8" > > >> > > > >> > I have been using MPI for many many years so I have very well > debugged > > >> mpi > > >> > tests. I am > > >> > having trouble on either openmpi-1.4.5 or openmpi-1.6.5 versions > > >> though > > >> > with getting the > > >> > MPI_Barrier calls to work. It works fine when I run all processes > on > > >> one > > >> > machine but when > > >> > I run with two or more hosts the second call to MPI_Barrier always > > >> hangs. > > >> > Not the first one, > > >> > but always the second one. I looked at FAQ's and such but found > > >> nothing > > >> > except for a comment > > >> > that MPI_Barrier problems were often problems with fire walls. Also > > >> > mentioned as a problem > > >> > was not having the same version of mpi on both machines. I turned > > >> > firewalls off and removed > > >> > and reinstalled the same version on both hosts but I still see the > same > > >> > thing. I then installed > > >> > lam mpi on two of my machines and that works fine. I can call the > > >> > MPI_Barrier function when run on > > >> > one of two machines by itself many times with no hangs. Only > hangs if > > >> two > > >> > or more hosts are involved. > > >> > These runs are all being done on CentOS release 6.4. Here is test > > >> program > > >> > I used. > > >> > > > >> > main (argc, argv) > > >> > int argc; > > >> > char **argv; > > >> > { > > >> > char message[20]; > > >> > char hoster[256]; > > >> > char nameis[256]; > > >> > int fd, i, j, jnp, iret, myrank, np, ranker, recker; > > >> > MPI_Comm comm; > > >> > MPI_Status status; > > >> > > > >> > MPI_Init( &argc, &argv ); > > >> > MPI_Comm_rank( MPI_COMM_WORLD, &myrank); > > >> > MPI_Comm_size( MPI_COMM_WORLD, &np); > > >> > > > >> > gethostname(hoster,256); > > >> > > > >> > printf(" In rank %d and host= %s Do Barrier call > > >> > 1.\n",myrank,hoster); > > >> > MPI_Barrier(MPI_COMM_WORLD); > > >> > printf(" In rank %d and host= %s Do Barrier call > > >> > 2.\n",myrank,hoster); > > >> > MPI_Barrier(MPI_COMM_WORLD); > > >> > printf(" In rank %d and host= %s Do Barrier call > > >> > 3.\n",myrank,hoster); > > >> > MPI_Barrier(MPI_COMM_WORLD); > > >> > MPI_Finalize(); > > >> > exit(0); > > >> > } > > >> > > > >> > Here are three runs of test program. First with two processes on > one > > >> > host, then with > > >> > two processes on another host, and finally with one process on each > of > > >> two > > >> > hosts. The > > >> > first two runs are fine but the last run hangs on the second > > >> MPI_Barrier. > > >> > > > >> > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out > > >> > In rank 0 and host= centos Do Barrier call 1. > > >> > In rank 1 and host= centos Do Barrier call 1. > > >> > In rank 1 and host= centos Do Barrier call 2. > > >> > In rank 1 and host= centos Do Barrier call 3. > > >> > In rank 0 and host= centos Do Barrier call 2. > > >> > In rank 0 and host= centos Do Barrier call 3. > > >> > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out > > >> > /root/.bashrc: line 14: unalias: ls: not found > > >> > In rank 0 and host= RAID Do Barrier call 1. > > >> > In rank 0 and host= RAID Do Barrier call 2. > > >> > In rank 0 and host= RAID Do Barrier call 3. > > >> > In rank 1 and host= RAID Do Barrier call 1. > > >> > In rank 1 and host= RAID Do Barrier call 2. > > >> > In rank 1 and host= RAID Do Barrier call 3. > > >> > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos,RAID > a.out > > >> > /root/.bashrc: line 14: unalias: ls: not found > > >> > In rank 0 and host= centos Do Barrier call 1. > > >> > In rank 0 and host= centos Do Barrier call 2. > > >> > In rank 1 and host= RAID Do Barrier call 1. > > >> > In rank 1 and host= RAID Do Barrier call 2. > > >> > > > >> > Since it is such a simple test and problem and such a widely used > MPI > > >> > function, it must obviously > > >> > be an installation or configuration problem. A pstack for each of > the > > >> > hung MPI_Barrier processes > > >> > on the two machines shows this: > > >> > > > >> > [root@centos ~]# pstack 31666 > > >> > #0 0x0000003baf0e8ee3 in __epoll_wait_nocancel () from > /lib64/libc.so.6 > > >> > #1 0x00007f5de06125eb in epoll_dispatch () from > > >> /usr/local/lib/libmpi.so.1 > > >> > #2 0x00007f5de061475a in opal_event_base_loop () from > > >> > /usr/local/lib/libmpi.so.1 > > >> > #3 0x00007f5de0639229 in opal_progress () from > > >> /usr/local/lib/libmpi.so.1 > > >> > #4 0x00007f5de0586f75 in ompi_request_default_wait_all () from > > >> > /usr/local/lib/libmpi.so.1 > > >> > #5 0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from > > >> > /usr/local/lib/openmpi/mca_coll_tuned.so > > >> > #6 0x00007f5ddc59d8ff in ompi_coll_tuned_barrier_intra_two_procs () > > >> from > > >> > /usr/local/lib/openmpi/mca_coll_tuned.so > > >> > #7 0x00007f5de05941c2 in PMPI_Barrier () from > > >> /usr/local/lib/libmpi.so.1 > > >> > #8 0x0000000000400a43 in main () > > >> > > > >> > [root@RAID openmpi-1.6.5]# pstack 22167 > > >> > #0 0x00000030302e8ee3 in __epoll_wait_nocancel () from > /lib64/libc.so.6 > > >> > #1 0x00007f7ee46885eb in epoll_dispatch () from > > >> /usr/local/lib/libmpi.so.1 > > >> > #2 0x00007f7ee468a75a in opal_event_base_loop () from > > >> > /usr/local/lib/libmpi.so.1 > > >> > #3 0x00007f7ee46af229 in opal_progress () from > > >> /usr/local/lib/libmpi.so.1 > > >> > #4 0x00007f7ee45fcf75 in ompi_request_default_wait_all () from > > >> > /usr/local/lib/libmpi.so.1 > > >> > #5 0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from > > >> > /usr/local/lib/openmpi/mca_coll_tuned.so > > >> > #6 0x00007f7ee06138ff in ompi_coll_tuned_barrier_intra_two_procs () > > >> from > > >> > /usr/local/lib/openmpi/mca_coll_tuned.so > > >> > #7 0x00007f7ee460a1c2 in PMPI_Barrier () from > > >> /usr/local/lib/libmpi.so.1 > > >> > #8 0x0000000000400a43 in main () > > >> > > > >> > Which looks exactly the same on each machine. Any thoughts or > ideas > > >> would > > >> > be greatly appreciated as > > >> > I am stuck. > > >> > > > >> > Clay Kirkland > > >> > -------------- next part -------------- > > >> > HTML attachment scrubbed and removed > > >> > > > >> > ------------------------------ > > >> > > > >> > Message: 2 > > >> > Date: Sat, 3 May 2014 06:39:20 -0700 > > >> > From: Ralph Castain <r...@open-mpi.org> > > >> > To: Open MPI Users <us...@open-mpi.org> > > >> > Subject: Re: [OMPI users] MPI_Barrier hangs on second attempt but > only > > >> > when multiple hosts used. > > >> > Message-ID: <3cf53d73-15d9-40bb-a2de-50ba3561a...@open-mpi.org> > > >> > Content-Type: text/plain; charset="us-ascii" > > >> > > > >> > Hmmm...just testing on my little cluster here on two nodes, it works > > >> just fine with 1.8.2: > > >> > > > >> > [rhc@bend001 v1.8]$ mpirun -n 2 --map-by node ./a.out > > >> > In rank 0 and host= bend001 Do Barrier call 1. > > >> > In rank 0 and host= bend001 Do Barrier call 2. > > >> > In rank 0 and host= bend001 Do Barrier call 3. > > >> > In rank 1 and host= bend002 Do Barrier call 1. > > >> > In rank 1 and host= bend002 Do Barrier call 2. > > >> > In rank 1 and host= bend002 Do Barrier call 3. > > >> > [rhc@bend001 v1.8]$ > > >> > > > >> > > > >> > How are you configuring OMPI? > > >> > > > >> > > > >> > On May 2, 2014, at 2:24 PM, Clay Kirkland < > clay.kirkl...@versityinc.com> > > >> wrote: > > >> > > > >> > > I have been using MPI for many many years so I have very well > > >> debugged mpi tests. I am > > >> > > having trouble on either openmpi-1.4.5 or openmpi-1.6.5 versions > > >> though with getting the > > >> > > MPI_Barrier calls to work. It works fine when I run all > processes > > >> on one machine but when > > >> > > I run with two or more hosts the second call to MPI_Barrier always > > >> hangs. Not the first one, > > >> > > but always the second one. I looked at FAQ's and such but found > > >> nothing except for a comment > > >> > > that MPI_Barrier problems were often problems with fire walls. > Also > > >> mentioned as a problem > > >> > > was not having the same version of mpi on both machines. I turned > > >> firewalls off and removed > > >> > > and reinstalled the same version on both hosts but I still see the > > >> same thing. I then installed > > >> > > lam mpi on two of my machines and that works fine. I can call > the > > >> MPI_Barrier function when run on > > >> > > one of two machines by itself many times with no hangs. Only > hangs > > >> if two or more hosts are involved. > > >> > > These runs are all being done on CentOS release 6.4. Here is > test > > >> program I used. > > >> > > > > >> > > main (argc, argv) > > >> > > int argc; > > >> > > char **argv; > > >> > > { > > >> > > char message[20]; > > >> > > char hoster[256]; > > >> > > char nameis[256]; > > >> > > int fd, i, j, jnp, iret, myrank, np, ranker, recker; > > >> > > MPI_Comm comm; > > >> > > MPI_Status status; > > >> > > > > >> > > MPI_Init( &argc, &argv ); > > >> > > MPI_Comm_rank( MPI_COMM_WORLD, &myrank); > > >> > > MPI_Comm_size( MPI_COMM_WORLD, &np); > > >> > > > > >> > > gethostname(hoster,256); > > >> > > > > >> > > printf(" In rank %d and host= %s Do Barrier call > > >> 1.\n",myrank,hoster); > > >> > > MPI_Barrier(MPI_COMM_WORLD); > > >> > > printf(" In rank %d and host= %s Do Barrier call > > >> 2.\n",myrank,hoster); > > >> > > MPI_Barrier(MPI_COMM_WORLD); > > >> > > printf(" In rank %d and host= %s Do Barrier call > > >> 3.\n",myrank,hoster); > > >> > > MPI_Barrier(MPI_COMM_WORLD); > > >> > > MPI_Finalize(); > > >> > > exit(0); > > >> > > } > > >> > > > > >> > > Here are three runs of test program. First with two processes > on > > >> one host, then with > > >> > > two processes on another host, and finally with one process on > each > > >> of two hosts. The > > >> > > first two runs are fine but the last run hangs on the second > > >> MPI_Barrier. > > >> > > > > >> > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos > a.out > > >> > > In rank 0 and host= centos Do Barrier call 1. > > >> > > In rank 1 and host= centos Do Barrier call 1. > > >> > > In rank 1 and host= centos Do Barrier call 2. > > >> > > In rank 1 and host= centos Do Barrier call 3. > > >> > > In rank 0 and host= centos Do Barrier call 2. > > >> > > In rank 0 and host= centos Do Barrier call 3. > > >> > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out > > >> > > /root/.bashrc: line 14: unalias: ls: not found > > >> > > In rank 0 and host= RAID Do Barrier call 1. > > >> > > In rank 0 and host= RAID Do Barrier call 2. > > >> > > In rank 0 and host= RAID Do Barrier call 3. > > >> > > In rank 1 and host= RAID Do Barrier call 1. > > >> > > In rank 1 and host= RAID Do Barrier call 2. > > >> > > In rank 1 and host= RAID Do Barrier call 3. > > >> > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos,RAID > > >> a.out > > >> > > /root/.bashrc: line 14: unalias: ls: not found > > >> > > In rank 0 and host= centos Do Barrier call 1. > > >> > > In rank 0 and host= centos Do Barrier call 2. > > >> > > In rank 1 and host= RAID Do Barrier call 1. > > >> > > In rank 1 and host= RAID Do Barrier call 2. > > >> > > > > >> > > Since it is such a simple test and problem and such a widely > used > > >> MPI function, it must obviously > > >> > > be an installation or configuration problem. A pstack for each > of > > >> the hung MPI_Barrier processes > > >> > > on the two machines shows this: > > >> > > > > >> > > [root@centos ~]# pstack 31666 > > >> > > #0 0x0000003baf0e8ee3 in __epoll_wait_nocancel () from > > >> /lib64/libc.so.6 > > >> > > #1 0x00007f5de06125eb in epoll_dispatch () from > > >> /usr/local/lib/libmpi.so.1 > > >> > > #2 0x00007f5de061475a in opal_event_base_loop () from > > >> /usr/local/lib/libmpi.so.1 > > >> > > #3 0x00007f5de0639229 in opal_progress () from > > >> /usr/local/lib/libmpi.so.1 > > >> > > #4 0x00007f5de0586f75 in ompi_request_default_wait_all () from > > >> /usr/local/lib/libmpi.so.1 > > >> > > #5 0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from > > >> /usr/local/lib/openmpi/mca_coll_tuned.so > > >> > > #6 0x00007f5ddc59d8ff in ompi_coll_tuned_barrier_intra_two_procs > () > > >> from /usr/local/lib/openmpi/mca_coll_tuned.so > > >> > > #7 0x00007f5de05941c2 in PMPI_Barrier () from > > >> /usr/local/lib/libmpi.so.1 > > >> > > #8 0x0000000000400a43 in main () > > >> > > > > >> > > [root@RAID openmpi-1.6.5]# pstack 22167 > > >> > > #0 0x00000030302e8ee3 in __epoll_wait_nocancel () from > > >> /lib64/libc.so.6 > > >> > > #1 0x00007f7ee46885eb in epoll_dispatch () from > > >> /usr/local/lib/libmpi.so.1 > > >> > > #2 0x00007f7ee468a75a in opal_event_base_loop () from > > >> /usr/local/lib/libmpi.so.1 > > >> > > #3 0x00007f7ee46af229 in opal_progress () from > > >> /usr/local/lib/libmpi.so.1 > > >> > > #4 0x00007f7ee45fcf75 in ompi_request_default_wait_all () from > > >> /usr/local/lib/libmpi.so.1 > > >> > > #5 0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from > > >> /usr/local/lib/openmpi/mca_coll_tuned.so > > >> > > #6 0x00007f7ee06138ff in ompi_coll_tuned_barrier_intra_two_procs > () > > >> from /usr/local/lib/openmpi/mca_coll_tuned.so > > >> > > #7 0x00007f7ee460a1c2 in PMPI_Barrier () from > > >> /usr/local/lib/libmpi.so.1 > > >> > > #8 0x0000000000400a43 in main () > > >> > > > > >> > > Which looks exactly the same on each machine. Any thoughts or > ideas > > >> would be greatly appreciated as > > >> > > I am stuck. > > >> > > > > >> > > Clay Kirkland > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > _______________________________________________ > > >> > > users mailing list > > >> > > us...@open-mpi.org > > >> > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> > > > >> > -------------- next part -------------- > > >> > HTML attachment scrubbed and removed > > >> > > > >> > ------------------------------ > > >> > > > >> > Subject: Digest Footer > > >> > > > >> > _______________________________________________ > > >> > users mailing list > > >> > us...@open-mpi.org > > >> > http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> > > > >> > ------------------------------ > > >> > > > >> > End of users Digest, Vol 2879, Issue 1 > > >> > ************************************** > > >> > > > >> > _______________________________________________ > > >> > users mailing list > > >> > us...@open-mpi.org > > >> > http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> > > >> > > >> -- > > >> Jeff Squyres > > >> jsquy...@cisco.com > > >> For corporate legal information go to: > > >> http://www.cisco.com/web/about/doing_business/legal/cri/ > > >> > > >> > > >> > > >> ------------------------------ > > >> > > >> Subject: Digest Footer > > >> > > >> _______________________________________________ > > >> users mailing list > > >> us...@open-mpi.org > > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> > > >> ------------------------------ > > >> > > >> End of users Digest, Vol 2881, Issue 1 > > >> ************************************** > > >> > > > > > > > > -------------- next part -------------- > > HTML attachment scrubbed and removed > > > > ------------------------------ > > > > Subject: Digest Footer > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ------------------------------ > > > > End of users Digest, Vol 2881, Issue 2 > > ************************************** > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > -------------- next part -------------- > HTML attachment scrubbed and removed > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ------------------------------ > > End of users Digest, Vol 2881, Issue 4 > ************************************** >