On 05/06/2014 09:49 PM, Ralph Castain wrote:

On May 6, 2014, at 6:24 PM, Clay Kirkland <clay.kirkl...@versityinc.com
<mailto:clay.kirkl...@versityinc.com>> wrote:

 Got it to work finally.  The longer line doesn't work.
192.168.0.0/1
But if I take off the -mca oob_tcp_if_include 192.168.0.0/16
<http://192.168.0.0/16> part then everything works from
every combination of machines I have.

Interesting - I'm surprised, but glad it worked


Could it be perhaps 192.168.0.0/24 (instead of /16)?
The ifconfig output says the netmask is 255.255.255.0.


And as to any MPI having trouble, in my original posting I stated that
I installed lam mpi
on the same hardware and it worked just fine.   Maybe you guys should
look at what they
do and copy it.   Virtually every machine I have used in the last 5
years has multiple nic
interfaces and almost all of them are set up to use only 1
interface.   It seems odd to have
a product that is designed to lash together multiple machines and have
it fail with default
install on generic machines.

Actually, we are the "lam mpi" guys :-)

There clearly is a bug in the connection logic, but a little hint will
work it thru until we can resolve it.


  But software is like that some time and I want to thank you  much
for all the help.   Please
take my criticism with a grain of salt.   I love MPI, I just want to
see it work.   I have been
using it for 20 some years to synchronize multiple machines for I/O
testing and it is one
slick product for that.   It has helped us find many bugs in shared
files systems.  Thanks
again,

No problem!





On Tue, May 6, 2014 at 7:45 PM, <users-requ...@open-mpi.org
<mailto:users-requ...@open-mpi.org>> wrote:

    Send users mailing list submissions to
    us...@open-mpi.org <mailto:us...@open-mpi.org>

    To subscribe or unsubscribe via the World Wide Web, visit
    http://www.open-mpi.org/mailman/listinfo.cgi/users
    or, via email, send a message with subject or body 'help' to
    users-requ...@open-mpi.org <mailto:users-requ...@open-mpi.org>

    You can reach the person managing the list at
    users-ow...@open-mpi.org <mailto:users-ow...@open-mpi.org>

    When replying, please edit your Subject line so it is more specific
    than "Re: Contents of users digest..."


    Today's Topics:

       1. Re: users Digest, Vol 2881, Issue 2 (Ralph Castain)


    ----------------------------------------------------------------------

    Message: 1
    Date: Tue, 6 May 2014 17:45:09 -0700
    From: Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org>>
    To: Open MPI Users <us...@open-mpi.org <mailto:us...@open-mpi.org>>
    Subject: Re: [OMPI users] users Digest, Vol 2881, Issue 2
    Message-ID: <4b207e61-952a-4744-9a7b-0704c4b0d...@open-mpi.org
    <mailto:4b207e61-952a-4744-9a7b-0704c4b0d...@open-mpi.org>>
    Content-Type: text/plain; charset="us-ascii"

    -mca btl_tcp_if_include 192.168.0.0/16 <http://192.168.0.0/16>
    -mca oob_tcp_if_include 192.168.0.0/16 <http://192.168.0.0/16>

    should do the trick. Any MPI is going to have trouble with your
    arrangement - just need a little hint to help figure it out.


    On May 6, 2014, at 5:14 PM, Clay Kirkland
    <clay.kirkl...@versityinc.com
    <mailto:clay.kirkl...@versityinc.com>> wrote:

    >  Someone suggested using some network address if all machines
    are on same subnet.
    > They are all on the same subnet, I think.   I have no idea what
    to put for a param there.
    > I tried the ethernet address but of course it couldn't be that
    simple.  Here are my ifconfig
    > outputs from a couple of machines:
    >
    > [root@RAID MPI]# ifconfig -a
    > eth0      Link encap:Ethernet  HWaddr 00:25:90:73:2A:36
    >           inet addr:192.168.0.59  Bcast:192.168.0.255
     Mask:255.255.255.0
    >           inet6 addr: fe80::225:90ff:fe73:2a36/64 Scope:Link
    >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
    >           RX packets:17983 errors:0 dropped:0 overruns:0 frame:0
    >           TX packets:9952 errors:0 dropped:0 overruns:0 carrier:0
    >           collisions:0 txqueuelen:1000
    >           RX bytes:26309771 (25.0 MiB)  TX bytes:758940 (741.1 KiB)
    >           Interrupt:16 Memory:fbde0000-fbe00000
    >
    > eth1      Link encap:Ethernet  HWaddr 00:25:90:73:2A:37
    >           inet6 addr: fe80::225:90ff:fe73:2a37/64 Scope:Link
    >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
    >           RX packets:56 errors:0 dropped:0 overruns:0 frame:0
    >           TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
    >           collisions:0 txqueuelen:1000
    >           RX bytes:3924 (3.8 KiB)  TX bytes:468 (468.0 b)
    >           Interrupt:17 Memory:fbee0000-fbf00000
    >
    >  And from one that I can't get to work:
    >
    > [root@centos ~]# ifconfig -a
    > eth0      Link encap:Ethernet  HWaddr 00:1E:4F:FB:30:34
    >           inet6 addr: fe80::21e:4fff:fefb:3034/64 Scope:Link
    >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
    >           RX packets:45 errors:0 dropped:0 overruns:0 frame:0
    >           TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
    >           collisions:0 txqueuelen:1000
    >           RX bytes:2700 (2.6 KiB)  TX bytes:468 (468.0 b)
    >           Interrupt:21 Memory:fe9e0000-fea00000
    >
    > eth1      Link encap:Ethernet  HWaddr 00:14:D1:22:8E:50
    >           inet addr:192.168.0.154  Bcast:192.168.0.255
     Mask:255.255.255.0
    >           inet6 addr: fe80::214:d1ff:fe22:8e50/64 Scope:Link
    >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
    >           RX packets:160 errors:0 dropped:0 overruns:0 frame:0
    >           TX packets:120 errors:0 dropped:0 overruns:0 carrier:0
    >           collisions:0 txqueuelen:1000
    >           RX bytes:31053 (30.3 KiB)  TX bytes:18897 (18.4 KiB)
    >           Interrupt:16 Base address:0x2f00
    >
    >
    >  The centos machine is using eth1 and not eth0, therein lies the
    problem.
    >
    >  I don't really need all this optimization of using multiple
    ethernet adaptors to speed things
    > up.   I am just using MPI to synchronize I/O tests.   Can I go
    back to a really old version
    > and avoid all this pain full debugging???
    >
    >
    >
    >
    > On Tue, May 6, 2014 at 6:50 PM, <users-requ...@open-mpi.org
    <mailto:users-requ...@open-mpi.org>> wrote:
    > Send users mailing list submissions to
    > us...@open-mpi.org <mailto:us...@open-mpi.org>
    >
    > To subscribe or unsubscribe via the World Wide Web, visit
    > http://www.open-mpi.org/mailman/listinfo.cgi/users
    > or, via email, send a message with subject or body 'help' to
    > users-requ...@open-mpi.org <mailto:users-requ...@open-mpi.org>
    >
    > You can reach the person managing the list at
    > users-ow...@open-mpi.org <mailto:users-ow...@open-mpi.org>
    >
    > When replying, please edit your Subject line so it is more specific
    > than "Re: Contents of users digest..."
    >
    >
    > Today's Topics:
    >
    >    1. Re: users Digest, Vol 2881, Issue 1 (Clay Kirkland)
    >    2. Re: users Digest, Vol 2881, Issue 1 (Clay Kirkland)
    >
    >
    >
    ----------------------------------------------------------------------
    >
    > Message: 1
    > Date: Tue, 6 May 2014 18:28:59 -0500
    > From: Clay Kirkland <clay.kirkl...@versityinc.com
    <mailto:clay.kirkl...@versityinc.com>>
    > To: us...@open-mpi.org <mailto:us...@open-mpi.org>
    > Subject: Re: [OMPI users] users Digest, Vol 2881, Issue 1
    > Message-ID:
    >
    <cajdnja90buhwu_ihssnna1a4p35+o96rrxk19xnhwo-nsd_...@mail.gmail.com 
<mailto:cajdnja90buhwu_ihssnna1a4p35%2bo96rrxk19xnhwo-nsd_...@mail.gmail.com>>
    > Content-Type: text/plain; charset="utf-8"
    >
    >  That last trick seems to work.  I can get it to work once in a
    while with
    > those tcp options but it is
    > tricky as I have three machines and two of them use eth0 as
    primary network
    > interface and one
    > uses eth1.   But by fiddling with network options and perhaps
    moving a
    > cable or two I think I can
    > get it all to work    Thanks much for the tip.
    >
    >  Clay
    >
    >
    > On Tue, May 6, 2014 at 11:00 AM, <users-requ...@open-mpi.org
    <mailto:users-requ...@open-mpi.org>> wrote:
    >
    > > Send users mailing list submissions to
    > > us...@open-mpi.org <mailto:us...@open-mpi.org>
    > >
    > > To subscribe or unsubscribe via the World Wide Web, visit
    > > http://www.open-mpi.org/mailman/listinfo.cgi/users
    > > or, via email, send a message with subject or body 'help' to
    > > users-requ...@open-mpi.org <mailto:users-requ...@open-mpi.org>
    > >
    > > You can reach the person managing the list at
    > > users-ow...@open-mpi.org <mailto:users-ow...@open-mpi.org>
    > >
    > > When replying, please edit your Subject line so it is more
    specific
    > > than "Re: Contents of users digest..."
    > >
    > >
    > > Today's Topics:
    > >
    > >    1. Re: MPI_Barrier hangs on second attempt but only  when
    > >       multiple hosts used. (Daniels, Marcus G)
    > >    2. ROMIO bug reading darrays (Richard Shaw)
    > >    3. MPI File Open does not work (Imran Ali)
    > >    4. Re: MPI File Open does not work (Jeff Squyres (jsquyres))
    > >    5. Re: MPI File Open does not work (Imran Ali)
    > >    6. Re: MPI File Open does not work (Jeff Squyres (jsquyres))
    > >    7. Re: MPI File Open does not work (Imran Ali)
    > >    8. Re: MPI File Open does not work (Jeff Squyres (jsquyres))
    > >    9. Re: users Digest, Vol 2879, Issue 1 (Jeff Squyres
    (jsquyres))
    > >
    > >
    > >
    ----------------------------------------------------------------------
    > >
    > > Message: 1
    > > Date: Mon, 5 May 2014 19:28:07 +0000
    > > From: "Daniels, Marcus G" <mdani...@lanl.gov
    <mailto:mdani...@lanl.gov>>
    > > To: "'us...@open-mpi.org <mailto:us...@open-mpi.org>'"
    <us...@open-mpi.org <mailto:us...@open-mpi.org>>
    > > Subject: Re: [OMPI users] MPI_Barrier hangs on second attempt
    but only
    > >         when    multiple hosts used.
    > > Message-ID:
    > >         <
    > >
    532c594b7920a549a2a91cb4312cc57640dc5...@ecs-exg-p-mb01.win.lanl.gov
    
<mailto:532c594b7920a549a2a91cb4312cc57640dc5...@ecs-exg-p-mb01.win.lanl.gov>>
    > > Content-Type: text/plain; charset="utf-8"
    > >
    > >
    > >
    > > From: Clay Kirkland [mailto:clay.kirkl...@versityinc.com
    <mailto:clay.kirkl...@versityinc.com>]
    > > Sent: Friday, May 02, 2014 03:24 PM
    > > To: us...@open-mpi.org <mailto:us...@open-mpi.org>
    <us...@open-mpi.org <mailto:us...@open-mpi.org>>
    > > Subject: [OMPI users] MPI_Barrier hangs on second attempt but
    only when
    > > multiple hosts used.
    > >
    > > I have been using MPI for many many years so I have very well
    debugged mpi
    > > tests.   I am
    > > having trouble on either openmpi-1.4.5  or  openmpi-1.6.5
    versions though
    > > with getting the
    > > MPI_Barrier calls to work.   It works fine when I run all
    processes on one
    > > machine but when
    > > I run with two or more hosts the second call to MPI_Barrier
    always hangs.
    > >   Not the first one,
    > > but always the second one.   I looked at FAQ's and such but
    found nothing
    > > except for a comment
    > > that MPI_Barrier problems were often problems with fire walls.
     Also
    > > mentioned as a problem
    > > was not having the same version of mpi on both machines.  I turned
    > > firewalls off and removed
    > > and reinstalled the same version on both hosts but I still see
    the same
    > > thing.   I then installed
    > > lam mpi on two of my machines and that works fine.   I can
    call the
    > > MPI_Barrier function when run on
    > > one of two machines by itself  many times with no hangs.  Only
    hangs if
    > > two or more hosts are involved.
    > > These runs are all being done on CentOS release 6.4.   Here is
    test
    > > program I used.
    > >
    > > main (argc, argv)
    > > int argc;
    > > char **argv;
    > > {
    > >     char message[20];
    > >     char hoster[256];
    > >     char nameis[256];
    > >     int fd, i, j, jnp, iret, myrank, np, ranker, recker;
    > >     MPI_Comm comm;
    > >     MPI_Status status;
    > >
    > >     MPI_Init( &argc, &argv );
    > >     MPI_Comm_rank( MPI_COMM_WORLD, &myrank);
    > >     MPI_Comm_size( MPI_COMM_WORLD, &np);
    > >
    > >         gethostname(hoster,256);
    > >
    > >         printf(" In rank %d and host= %s  Do Barrier call
    > > 1.\n",myrank,hoster);
    > >     MPI_Barrier(MPI_COMM_WORLD);
    > >         printf(" In rank %d and host= %s  Do Barrier call
    > > 2.\n",myrank,hoster);
    > >     MPI_Barrier(MPI_COMM_WORLD);
    > >         printf(" In rank %d and host= %s  Do Barrier call
    > > 3.\n",myrank,hoster);
    > >     MPI_Barrier(MPI_COMM_WORLD);
    > >     MPI_Finalize();
    > >     exit(0);
    > > }
    > >
    > >   Here are three runs of test program.  First with two
    processes on one
    > > host, then with
    > > two processes on another host, and finally with one process on
    each of two
    > > hosts.  The
    > > first two runs are fine but the last run hangs on the second
    MPI_Barrier.
    > >
    > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out
    > >  In rank 0 and host= centos  Do Barrier call 1.
    > >  In rank 1 and host= centos  Do Barrier call 1.
    > >  In rank 1 and host= centos  Do Barrier call 2.
    > >  In rank 1 and host= centos  Do Barrier call 3.
    > >  In rank 0 and host= centos  Do Barrier call 2.
    > >  In rank 0 and host= centos  Do Barrier call 3.
    > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out
    > > /root/.bashrc: line 14: unalias: ls: not found
    > >  In rank 0 and host= RAID  Do Barrier call 1.
    > >  In rank 0 and host= RAID  Do Barrier call 2.
    > >  In rank 0 and host= RAID  Do Barrier call 3.
    > >  In rank 1 and host= RAID  Do Barrier call 1.
    > >  In rank 1 and host= RAID  Do Barrier call 2.
    > >  In rank 1 and host= RAID  Do Barrier call 3.
    > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host
    centos,RAID a.out
    > > /root/.bashrc: line 14: unalias: ls: not found
    > >  In rank 0 and host= centos  Do Barrier call 1.
    > >  In rank 0 and host= centos  Do Barrier call 2.
    > > In rank 1 and host= RAID  Do Barrier call 1.
    > >  In rank 1 and host= RAID  Do Barrier call 2.
    > >
    > >   Since it is such a simple test and problem and such a widely
    used MPI
    > > function, it must obviously
    > > be an installation or configuration problem.   A pstack for
    each of the
    > > hung MPI_Barrier processes
    > > on the two machines shows this:
    > >
    > > [root@centos ~]# pstack 31666
    > > #0  0x0000003baf0e8ee3 in __epoll_wait_nocancel () from
    /lib64/libc.so.6
    > > #1  0x00007f5de06125eb in epoll_dispatch () from
    /usr/local/lib/libmpi.so.1
    > > #2  0x00007f5de061475a in opal_event_base_loop () from
    > > /usr/local/lib/libmpi.so.1
    > > #3  0x00007f5de0639229 in opal_progress () from
    /usr/local/lib/libmpi.so.1
    > > #4  0x00007f5de0586f75 in ompi_request_default_wait_all () from
    > > /usr/local/lib/libmpi.so.1
    > > #5  0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from
    > > /usr/local/lib/openmpi/mca_coll_tuned.so
    > > #6  0x00007f5ddc59d8ff in
    ompi_coll_tuned_barrier_intra_two_procs () from
    > > /usr/local/lib/openmpi/mca_coll_tuned.so
    > > #7  0x00007f5de05941c2 in PMPI_Barrier () from
    /usr/local/lib/libmpi.so.1
    > > #8  0x0000000000400a43 in main ()
    > >
    > > [root@RAID openmpi-1.6.5]# pstack 22167
    > > #0  0x00000030302e8ee3 in __epoll_wait_nocancel () from
    /lib64/libc.so.6
    > > #1  0x00007f7ee46885eb in epoll_dispatch () from
    /usr/local/lib/libmpi.so.1
    > > #2  0x00007f7ee468a75a in opal_event_base_loop () from
    > > /usr/local/lib/libmpi.so.1
    > > #3  0x00007f7ee46af229 in opal_progress () from
    /usr/local/lib/libmpi.so.1
    > > #4  0x00007f7ee45fcf75 in ompi_request_default_wait_all () from
    > > /usr/local/lib/libmpi.so.1
    > > #5  0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from
    > > /usr/local/lib/openmpi/mca_coll_tuned.so
    > > #6  0x00007f7ee06138ff in
    ompi_coll_tuned_barrier_intra_two_procs () from
    > > /usr/local/lib/openmpi/mca_coll_tuned.so
    > > #7  0x00007f7ee460a1c2 in PMPI_Barrier () from
    /usr/local/lib/libmpi.so.1
    > > #8  0x0000000000400a43 in main ()
    > >
    > >  Which looks exactly the same on each machine.  Any thoughts
    or ideas
    > > would be greatly appreciated as
    > > I am stuck.
    > >
    > >  Clay Kirkland
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > > -------------- next part --------------
    > > HTML attachment scrubbed and removed
    > >
    > > ------------------------------
    > >
    > > Message: 2
    > > Date: Mon, 5 May 2014 22:20:59 -0400
    > > From: Richard Shaw <jr...@cita.utoronto.ca
    <mailto:jr...@cita.utoronto.ca>>
    > > To: Open MPI Users <us...@open-mpi.org
    <mailto:us...@open-mpi.org>>
    > > Subject: [OMPI users] ROMIO bug reading darrays
    > > Message-ID:
    > >         <
    > >
    can+evmkc+9kacnpausscziufwdj3jfcsymb-8zdx1etdkab...@mail.gmail.com
    
<mailto:can%2bevmkc%2b9kacnpausscziufwdj3jfcsymb-8zdx1etdkab...@mail.gmail.com>>
    > > Content-Type: text/plain; charset="utf-8"
    > >
    > > Hello,
    > >
    > > I think I've come across a bug when using ROMIO to read in a
    2D distributed
    > > array. I've attached a test case to this email.
    > >
    > > In the testcase I first initialise an array of 25 doubles
    (which will be a
    > > 5x5 grid), then I create a datatype representing a 5x5 matrix
    distributed
    > > in 3x3 blocks over a 2x2 process grid. As a reference I use
    MPI_Pack to
    > > pull out the block cyclic array elements local to each process
    (which I
    > > think is correct). Then I write the original array of 25
    doubles to disk,
    > > and use MPI-IO to read it back in (performing the Open,
    Set_view, and
    > > Real_all), and compare to the reference.
    > >
    > > Running this with OMPI, the two match on all ranks.
    > >
    > > > mpirun -mca io ompio -np 4 ./darr_read.x
    > > === Rank 0 === (9 elements)
    > > Packed:  0.0  1.0  2.0  5.0  6.0  7.0 10.0 11.0 12.0
    > > Read:    0.0  1.0  2.0  5.0  6.0  7.0 10.0 11.0 12.0
    > >
    > > === Rank 1 === (6 elements)
    > > Packed: 15.0 16.0 17.0 20.0 21.0 22.0
    > > Read:   15.0 16.0 17.0 20.0 21.0 22.0
    > >
    > > === Rank 2 === (6 elements)
    > > Packed:  3.0  4.0  8.0  9.0 13.0 14.0
    > > Read:    3.0  4.0  8.0  9.0 13.0 14.0
    > >
    > > === Rank 3 === (4 elements)
    > > Packed: 18.0 19.0 23.0 24.0
    > > Read:   18.0 19.0 23.0 24.0
    > >
    > >
    > >
    > > However, using ROMIO the two differ on two of the ranks:
    > >
    > > > mpirun -mca io romio -np 4 ./darr_read.x
    > > === Rank 0 === (9 elements)
    > > Packed:  0.0  1.0  2.0  5.0  6.0  7.0 10.0 11.0 12.0
    > > Read:    0.0  1.0  2.0  5.0  6.0  7.0 10.0 11.0 12.0
    > >
    > > === Rank 1 === (6 elements)
    > > Packed: 15.0 16.0 17.0 20.0 21.0 22.0
    > > Read:    0.0  1.0  2.0  0.0  1.0  2.0
    > >
    > > === Rank 2 === (6 elements)
    > > Packed:  3.0  4.0  8.0  9.0 13.0 14.0
    > > Read:    3.0  4.0  8.0  9.0 13.0 14.0
    > >
    > > === Rank 3 === (4 elements)
    > > Packed: 18.0 19.0 23.0 24.0
    > > Read:    0.0  1.0  0.0  1.0
    > >
    > >
    > >
    > > My interpretation is that the behaviour with OMPIO is correct.
    > > Interestingly everything matches up using both ROMIO and OMPIO
    if I set the
    > > block shape to 2x2.
    > >
    > > This was run on OS X using 1.8.2a1r31632. I have also run this
    on Linux
    > > with OpenMPI 1.7.4, and OMPIO is still correct, but using
    ROMIO I just get
    > > segfaults.
    > >
    > > Thanks,
    > > Richard
    > > -------------- next part --------------
    > > HTML attachment scrubbed and removed
    > > -------------- next part --------------
    > > A non-text attachment was scrubbed...
    > > Name: darr_read.c
    > > Type: text/x-csrc
    > > Size: 2218 bytes
    > > Desc: not available
    > > URL: <
    > >
    
http://www.open-mpi.org/MailArchives/users/attachments/20140505/5a5ab0ba/attachment.bin
    > > >
    > >
    > > ------------------------------
    > >
    > > Message: 3
    > > Date: Tue, 06 May 2014 13:24:35 +0200
    > > From: Imran Ali <imra...@student.matnat.uio.no
    <mailto:imra...@student.matnat.uio.no>>
    > > To: <us...@open-mpi.org <mailto:us...@open-mpi.org>>
    > > Subject: [OMPI users] MPI File Open does not work
    > > Message-ID: <d57bdf499c00360b737205b085c50...@ulrik.uio.no
    <mailto:d57bdf499c00360b737205b085c50...@ulrik.uio.no>>
    > > Content-Type: text/plain; charset="utf-8"
    > >
    > >
    > >
    > > I get the following error when I try to run the following python
    > > code
    > > import mpi4py.MPI as MPI
    > > comm = MPI.COMM_WORLD
    > >
    > > MPI.File.Open(comm,"some.file")
    > >
    > > $ mpirun -np 1 python
    > > test_mpi.py
    > > Traceback (most recent call last):
    > >  File "test_mpi.py", line
    > > 3, in <module>
    > >  MPI.File.Open(comm," h5ex_d_alloc.h5")
    > >  File "File.pyx",
    > > line 67, in mpi4py.MPI.File.Open
    > > (src/mpi4py.MPI.c:89639)
    > > mpi4py.MPI.Exception: MPI_ERR_OTHER: known
    > > error not in
    > > list
    > >
    --------------------------------------------------------------------------
    > > mpirun
    > > noticed that the job aborted, but has no info as to the process
    > > that
    > > caused that
    > > situation.
    > >
    --------------------------------------------------------------------------
    > >
    > >
    > > My mpirun version is (Open MPI) 1.6.2. I installed openmpi
    using the
    > > dorsal script (https://github.com/FEniCS/dorsal) for Redhat
    Enterprise 6
    > > (OS I am using, release 6.5) . It configured the build as
    following :
    > >
    > >
    > > ./configure --enable-mpi-thread-multiple
    --enable-opal-multi-threads
    > > --with-threads=posix --disable-mpi-profile
    > >
    > > I need emphasize that I do
    > > not have root acces on the system I am running my application.
    > >
    > > Imran
    > >
    > >
    > >
    > > -------------- next part --------------
    > > HTML attachment scrubbed and removed
    > >
    > > ------------------------------
    > >
    > > Message: 4
    > > Date: Tue, 6 May 2014 12:56:04 +0000
    > > From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com
    <mailto:jsquy...@cisco.com>>
    > > To: Open MPI Users <us...@open-mpi.org
    <mailto:us...@open-mpi.org>>
    > > Subject: Re: [OMPI users] MPI File Open does not work
    > > Message-ID: <e7df28cb-d4fb-4087-928e-18e61d1d2...@cisco.com
    <mailto:e7df28cb-d4fb-4087-928e-18e61d1d2...@cisco.com>>
    > > Content-Type: text/plain; charset="us-ascii"
    > >
    > > The thread support in the 1.6 series is not very good.  You
    might try:
    > >
    > > - Upgrading to 1.6.5
    > > - Or better yet, upgrading to 1.8.1
    > >
    > >
    > > On May 6, 2014, at 7:24 AM, Imran Ali
    <imra...@student.matnat.uio.no <mailto:imra...@student.matnat.uio.no>>
    > > wrote:
    > >
    > > > I get the following error when I try to run the following
    python code
    > > >
    > > > import mpi4py.MPI as MPI
    > > > comm =  MPI.COMM_WORLD
    > > > MPI.File.Open(comm,"some.file")
    > > >
    > > > $ mpirun -np 1 python test_mpi.py
    > > > Traceback (most recent call last):
    > > >   File "test_mpi.py", line 3, in <module>
    > > >     MPI.File.Open(comm," h5ex_d_alloc.h5")
    > > >   File "File.pyx", line 67, in mpi4py.MPI.File.Open
    > > (src/mpi4py.MPI.c:89639)
    > > > mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list
    > > >
    > >
    --------------------------------------------------------------------------
    > > > mpirun noticed that the job aborted, but has no info as to
    the process
    > > > that caused that situation.
    > > >
    > >
    --------------------------------------------------------------------------
    > > >
    > > > My mpirun version is (Open MPI) 1.6.2. I installed openmpi
    using the
    > > dorsal script (https://github.com/FEniCS/dorsal) for Redhat
    Enterprise 6
    > > (OS I am using, release 6.5) . It configured the build as
    following :
    > > >
    > > > ./configure --enable-mpi-thread-multiple
    --enable-opal-multi-threads
    > > --with-threads=posix --disable-mpi-profile
    > > >
    > > > I need emphasize that I do not have root acces on the system
    I am
    > > running my application.
    > > >
    > > > Imran
    > > >
    > > >
    > > > _______________________________________________
    > > > users mailing list
    > > > us...@open-mpi.org <mailto:us...@open-mpi.org>
    > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
    > >
    > >
    > > --
    > > Jeff Squyres
    > > jsquy...@cisco.com <mailto:jsquy...@cisco.com>
    > > For corporate legal information go to:
    > > http://www.cisco.com/web/about/doing_business/legal/cri/
    > >
    > >
    > >
    > > ------------------------------
    > >
    > > Message: 5
    > > Date: Tue, 6 May 2014 15:32:21 +0200
    > > From: Imran Ali <imra...@student.matnat.uio.no
    <mailto:imra...@student.matnat.uio.no>>
    > > To: Open MPI Users <us...@open-mpi.org
    <mailto:us...@open-mpi.org>>
    > > Subject: Re: [OMPI users] MPI File Open does not work
    > > Message-ID: <fa6dffff-6c66-4a47-84fc-148fb51ce...@math.uio.no
    <mailto:fa6dffff-6c66-4a47-84fc-148fb51ce...@math.uio.no>>
    > > Content-Type: text/plain; charset=us-ascii
    > >
    > >
    > > 6. mai 2014 kl. 14:56 skrev Jeff Squyres (jsquyres)
    <jsquy...@cisco.com <mailto:jsquy...@cisco.com>>:
    > >
    > > > The thread support in the 1.6 series is not very good.  You
    might try:
    > > >
    > > > - Upgrading to 1.6.5
    > > > - Or better yet, upgrading to 1.8.1
    > > >
    > >
    > > I will attempt that than. I read at
    > >
    > > http://www.open-mpi.org/faq/?category=building#install-overwrite
    > >
    > > that I should completely uninstall my previous version. Could you
    > > recommend to me how I can go about doing it (without root access).
    > > I am uncertain where I can use make uninstall.
    > >
    > > Imran
    > >
    > > >
    > > > On May 6, 2014, at 7:24 AM, Imran Ali
    <imra...@student.matnat.uio.no <mailto:imra...@student.matnat.uio.no>>
    > > wrote:
    > > >
    > > >> I get the following error when I try to run the following
    python code
    > > >>
    > > >> import mpi4py.MPI as MPI
    > > >> comm =  MPI.COMM_WORLD
    > > >> MPI.File.Open(comm,"some.file")
    > > >>
    > > >> $ mpirun -np 1 python test_mpi.py
    > > >> Traceback (most recent call last):
    > > >>  File "test_mpi.py", line 3, in <module>
    > > >>    MPI.File.Open(comm," h5ex_d_alloc.h5")
    > > >>  File "File.pyx", line 67, in mpi4py.MPI.File.Open
    > > (src/mpi4py.MPI.c:89639)
    > > >> mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list
    > > >>
    > >
    --------------------------------------------------------------------------
    > > >> mpirun noticed that the job aborted, but has no info as to
    the process
    > > >> that caused that situation.
    > > >>
    > >
    --------------------------------------------------------------------------
    > > >>
    > > >> My mpirun version is (Open MPI) 1.6.2. I installed openmpi
    using the
    > > dorsal script (https://github.com/FEniCS/dorsal) for Redhat
    Enterprise 6
    > > (OS I am using, release 6.5) . It configured the build as
    following :
    > > >>
    > > >> ./configure --enable-mpi-thread-multiple
    --enable-opal-multi-threads
    > > --with-threads=posix --disable-mpi-profile
    > > >>
    > > >> I need emphasize that I do not have root acces on the
    system I am
    > > running my application.
    > > >>
    > > >> Imran
    > > >>
    > > >>
    > > >> _______________________________________________
    > > >> users mailing list
    > > >> us...@open-mpi.org <mailto:us...@open-mpi.org>
    > > >> http://www.open-mpi.org/mailman/listinfo.cgi/users
    > > >
    > > >
    > > > --
    > > > Jeff Squyres
    > > > jsquy...@cisco.com <mailto:jsquy...@cisco.com>
    > > > For corporate legal information go to:
    > > http://www.cisco.com/web/about/doing_business/legal/cri/
    > > >
    > > > _______________________________________________
    > > > users mailing list
    > > > us...@open-mpi.org <mailto:us...@open-mpi.org>
    > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
    > >
    > >
    > >
    > > ------------------------------
    > >
    > > Message: 6
    > > Date: Tue, 6 May 2014 13:34:38 +0000
    > > From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com
    <mailto:jsquy...@cisco.com>>
    > > To: Open MPI Users <us...@open-mpi.org
    <mailto:us...@open-mpi.org>>
    > > Subject: Re: [OMPI users] MPI File Open does not work
    > > Message-ID: <2a933c0e-80f6-4ded-b44c-53b5f37ef...@cisco.com
    <mailto:2a933c0e-80f6-4ded-b44c-53b5f37ef...@cisco.com>>
    > > Content-Type: text/plain; charset="us-ascii"
    > >
    > > On May 6, 2014, at 9:32 AM, Imran Ali
    <imra...@student.matnat.uio.no <mailto:imra...@student.matnat.uio.no>>
    > > wrote:
    > >
    > > > I will attempt that than. I read at
    > > >
    > > > http://www.open-mpi.org/faq/?category=building#install-overwrite
    > > >
    > > > that I should completely uninstall my previous version.
    > >
    > > Yes, that is best.  OR: you can install into a whole separate
    tree and
    > > ignore the first installation.
    > >
    > > > Could you recommend to me how I can go about doing it
    (without root
    > > access).
    > > > I am uncertain where I can use make uninstall.
    > >
    > > If you don't have write access into the installation tree
    (i.e., it was
    > > installed via root and you don't have root access), then your
    best bet is
    > > simply to install into a new tree.  E.g., if OMPI is installed
    into
    > > /opt/openmpi-1.6.2, try installing into /opt/openmpi-1.6.5, or
    even
    > > $HOME/installs/openmpi-1.6.5, or something like that.
    > >
    > > --
    > > Jeff Squyres
    > > jsquy...@cisco.com <mailto:jsquy...@cisco.com>
    > > For corporate legal information go to:
    > > http://www.cisco.com/web/about/doing_business/legal/cri/
    > >
    > >
    > >
    > > ------------------------------
    > >
    > > Message: 7
    > > Date: Tue, 6 May 2014 15:40:34 +0200
    > > From: Imran Ali <imra...@student.matnat.uio.no
    <mailto:imra...@student.matnat.uio.no>>
    > > To: Open MPI Users <us...@open-mpi.org
    <mailto:us...@open-mpi.org>>
    > > Subject: Re: [OMPI users] MPI File Open does not work
    > > Message-ID: <14f0596c-c5c5-4b1a-a4a8-8849d44ab...@math.uio.no
    <mailto:14f0596c-c5c5-4b1a-a4a8-8849d44ab...@math.uio.no>>
    > > Content-Type: text/plain; charset=us-ascii
    > >
    > >
    > > 6. mai 2014 kl. 15:34 skrev Jeff Squyres (jsquyres)
    <jsquy...@cisco.com <mailto:jsquy...@cisco.com>>:
    > >
    > > > On May 6, 2014, at 9:32 AM, Imran Ali
    <imra...@student.matnat.uio.no <mailto:imra...@student.matnat.uio.no>>
    > > wrote:
    > > >
    > > >> I will attempt that than. I read at
    > > >>
    > > >>
    http://www.open-mpi.org/faq/?category=building#install-overwrite
    > > >>
    > > >> that I should completely uninstall my previous version.
    > > >
    > > > Yes, that is best.  OR: you can install into a whole
    separate tree and
    > > ignore the first installation.
    > > >
    > > >> Could you recommend to me how I can go about doing it
    (without root
    > > access).
    > > >> I am uncertain where I can use make uninstall.
    > > >
    > > > If you don't have write access into the installation tree
    (i.e., it was
    > > installed via root and you don't have root access), then your
    best bet is
    > > simply to install into a new tree.  E.g., if OMPI is installed
    into
    > > /opt/openmpi-1.6.2, try installing into /opt/openmpi-1.6.5, or
    even
    > > $HOME/installs/openmpi-1.6.5, or something like that.
    > >
    > > My install was in my user directory (i.e $HOME). I managed to
    locate the
    > > source directory and successfully run make uninstall.
    > >
    > > Will let you know how things went after installation.
    > >
    > > Imran
    > >
    > > >
    > > > --
    > > > Jeff Squyres
    > > > jsquy...@cisco.com <mailto:jsquy...@cisco.com>
    > > > For corporate legal information go to:
    > > http://www.cisco.com/web/about/doing_business/legal/cri/
    > > >
    > > > _______________________________________________
    > > > users mailing list
    > > > us...@open-mpi.org <mailto:us...@open-mpi.org>
    > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
    > >
    > >
    > >
    > > ------------------------------
    > >
    > > Message: 8
    > > Date: Tue, 6 May 2014 14:42:52 +0000
    > > From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com
    <mailto:jsquy...@cisco.com>>
    > > To: Open MPI Users <us...@open-mpi.org
    <mailto:us...@open-mpi.org>>
    > > Subject: Re: [OMPI users] MPI File Open does not work
    > > Message-ID: <710e3328-edaa-4a13-9f07-b45fe3191...@cisco.com
    <mailto:710e3328-edaa-4a13-9f07-b45fe3191...@cisco.com>>
    > > Content-Type: text/plain; charset="us-ascii"
    > >
    > > On May 6, 2014, at 9:40 AM, Imran Ali
    <imra...@student.matnat.uio.no <mailto:imra...@student.matnat.uio.no>>
    > > wrote:
    > >
    > > > My install was in my user directory (i.e $HOME). I managed
    to locate the
    > > source directory and successfully run make uninstall.
    > >
    > >
    > > FWIW, I usually install Open MPI into its own subdir.  E.g.,
    > > $HOME/installs/openmpi-x.y.z.  Then if I don't want that
    install any more,
    > > I can just "rm -rf $HOME/installs/openmpi-x.y.z" -- no need to
    "make
    > > uninstall".  Specifically: if there's nothing else installed
    in the same
    > > tree as Open MPI, you can just rm -rf its installation tree.
    > >
    > > --
    > > Jeff Squyres
    > > jsquy...@cisco.com <mailto:jsquy...@cisco.com>
    > > For corporate legal information go to:
    > > http://www.cisco.com/web/about/doing_business/legal/cri/
    > >
    > >
    > >
    > > ------------------------------
    > >
    > > Message: 9
    > > Date: Tue, 6 May 2014 14:50:34 +0000
    > > From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com
    <mailto:jsquy...@cisco.com>>
    > > To: Open MPI Users <us...@open-mpi.org
    <mailto:us...@open-mpi.org>>
    > > Subject: Re: [OMPI users] users Digest, Vol 2879, Issue 1
    > > Message-ID: <c60aa7e1-96a7-4ccd-9b5b-11a38fb87...@cisco.com
    <mailto:c60aa7e1-96a7-4ccd-9b5b-11a38fb87...@cisco.com>>
    > > Content-Type: text/plain; charset="us-ascii"
    > >
    > > Are you using TCP as the MPI transport?
    > >
    > > If so, another thing to try is to limit the IP interfaces that
    MPI uses
    > > for its traffic to see if there's some kind of problem with
    specific
    > > networks.
    > >
    > > For example:
    > >
    > >    mpirun --mca btl_tcp_if_include eth0 ...
    > >
    > > If that works, then try adding in any/all other IP interfaces
    that you
    > > have on your machines.
    > >
    > > A sorta-wild guess: you have some IP interfaces that aren't
    working, or at
    > > least, don't work in the way that OMPI wants them to work.  So
    the first
    > > barrier works because it flows across eth0 (or some other
    first network
    > > that, as far as OMPI is concerned, works just fine).  But then
    the next
    > > barrier round-robin advances to the next IP interface, and it
    doesn't work
    > > for some reason.
    > >
    > > We've seen virtual machine bridge interfaces cause problems,
    for example.
    > >  E.g., if a machine has a Xen virtual machine interface
    (vibr0, IIRC?),
    > > then OMPI will try to use it to communicate with peer MPI
    processes because
    > > it has a "compatible" IP address, and OMPI think it should be
    > > connected/reachable to peers.  If this is the case, you might
    want to
    > > disable such interfaces and/or use btl_tcp_if_include or
    btl_tcp_if_exclude
    > > to select the interfaces that you want to use.
    > >
    > > Pro tip: if you use btl_tcp_if_exclude, remember to exclude
    the loopback
    > > interface, too.  OMPI defaults to a btl_tcp_if_include=""
    (blank) and
    > > btl_tcp_if_exclude="lo0". So if you override
    btl_tcp_if_exclude, you need
    > > to be sure to *also* include lo0 in the new value.  For example:
    > >
    > >    mpirun --mca btl_tcp_if_exclude lo0,virb0 ...
    > >
    > > Also, if possible, try upgrading to Open MPI 1.8.1.
    > >
    > >
    > >
    > > On May 4, 2014, at 2:15 PM, Clay Kirkland
    <clay.kirkl...@versityinc.com <mailto:clay.kirkl...@versityinc.com>>
    > > wrote:
    > >
    > > >  I am configuring with all defaults.   Just doing a
    ./configure and then
    > > > make and make install.   I have used open mpi on several
    kinds of
    > > > unix  systems this way and have had no trouble before.   I
    believe I
    > > > last had success on a redhat version of linux.
    > > >
    > > >
    > > > On Sat, May 3, 2014 at 11:00 AM, <users-requ...@open-mpi.org
    <mailto:users-requ...@open-mpi.org>> wrote:
    > > > Send users mailing list submissions to
    > > > us...@open-mpi.org <mailto:us...@open-mpi.org>
    > > >
    > > > To subscribe or unsubscribe via the World Wide Web, visit
    > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
    > > > or, via email, send a message with subject or body 'help' to
    > > > users-requ...@open-mpi.org <mailto:users-requ...@open-mpi.org>
    > > >
    > > > You can reach the person managing the list at
    > > > users-ow...@open-mpi.org <mailto:users-ow...@open-mpi.org>
    > > >
    > > > When replying, please edit your Subject line so it is more
    specific
    > > > than "Re: Contents of users digest..."
    > > >
    > > >
    > > > Today's Topics:
    > > >
    > > >    1. MPI_Barrier hangs on second attempt but only when multiple
    > > >       hosts used. (Clay Kirkland)
    > > >    2. Re: MPI_Barrier hangs on second attempt but only when
    > > >       multiple hosts used. (Ralph Castain)
    > > >
    > > >
    > > >
    ----------------------------------------------------------------------
    > > >
    > > > Message: 1
    > > > Date: Fri, 2 May 2014 16:24:04 -0500
    > > > From: Clay Kirkland <clay.kirkl...@versityinc.com
    <mailto:clay.kirkl...@versityinc.com>>
    > > > To: us...@open-mpi.org <mailto:us...@open-mpi.org>
    > > > Subject: [OMPI users] MPI_Barrier hangs on second attempt
    but only
    > > >         when    multiple hosts used.
    > > > Message-ID:
    > > >         <CAJDnjA8Wi=FEjz6Vz+Bc34b+nFE=
    > > tf4b7g0bqgmbekg7h-p...@mail.gmail.com
    <mailto:tf4b7g0bqgmbekg7h-pv%...@mail.gmail.com>>
    > > > Content-Type: text/plain; charset="utf-8"
    > > >
    > > > I have been using MPI for many many years so I have very
    well debugged
    > > mpi
    > > > tests.   I am
    > > > having trouble on either openmpi-1.4.5  or  openmpi-1.6.5
    versions though
    > > > with getting the
    > > > MPI_Barrier calls to work.   It works fine when I run all
    processes on
    > > one
    > > > machine but when
    > > > I run with two or more hosts the second call to MPI_Barrier
    always hangs.
    > > > Not the first one,
    > > > but always the second one.   I looked at FAQ's and such but
    found nothing
    > > > except for a comment
    > > > that MPI_Barrier problems were often problems with fire
    walls.  Also
    > > > mentioned as a problem
    > > > was not having the same version of mpi on both machines.  I
    turned
    > > > firewalls off and removed
    > > > and reinstalled the same version on both hosts but I still
    see the same
    > > > thing.   I then installed
    > > > lam mpi on two of my machines and that works fine.   I can
    call the
    > > > MPI_Barrier function when run on
    > > > one of two machines by itself  many times with no hangs.
     Only hangs if
    > > two
    > > > or more hosts are involved.
    > > > These runs are all being done on CentOS release 6.4.   Here
    is test
    > > program
    > > > I used.
    > > >
    > > > main (argc, argv)
    > > > int argc;
    > > > char **argv;
    > > > {
    > > >     char message[20];
    > > >     char hoster[256];
    > > >     char nameis[256];
    > > >     int fd, i, j, jnp, iret, myrank, np, ranker, recker;
    > > >     MPI_Comm comm;
    > > >     MPI_Status status;
    > > >
    > > >     MPI_Init( &argc, &argv );
    > > >     MPI_Comm_rank( MPI_COMM_WORLD, &myrank);
    > > >     MPI_Comm_size( MPI_COMM_WORLD, &np);
    > > >
    > > >         gethostname(hoster,256);
    > > >
    > > >         printf(" In rank %d and host= %s  Do Barrier call
    > > > 1.\n",myrank,hoster);
    > > >     MPI_Barrier(MPI_COMM_WORLD);
    > > >         printf(" In rank %d and host= %s  Do Barrier call
    > > > 2.\n",myrank,hoster);
    > > >     MPI_Barrier(MPI_COMM_WORLD);
    > > >         printf(" In rank %d and host= %s  Do Barrier call
    > > > 3.\n",myrank,hoster);
    > > >     MPI_Barrier(MPI_COMM_WORLD);
    > > >     MPI_Finalize();
    > > >     exit(0);
    > > > }
    > > >
    > > >   Here are three runs of test program.  First with two
    processes on one
    > > > host, then with
    > > > two processes on another host, and finally with one process
    on each of
    > > two
    > > > hosts.  The
    > > > first two runs are fine but the last run hangs on the second
    MPI_Barrier.
    > > >
    > > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos
    a.out
    > > >  In rank 0 and host= centos  Do Barrier call 1.
    > > >  In rank 1 and host= centos  Do Barrier call 1.
    > > >  In rank 1 and host= centos  Do Barrier call 2.
    > > >  In rank 1 and host= centos  Do Barrier call 3.
    > > >  In rank 0 and host= centos  Do Barrier call 2.
    > > >  In rank 0 and host= centos  Do Barrier call 3.
    > > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out
    > > > /root/.bashrc: line 14: unalias: ls: not found
    > > >  In rank 0 and host= RAID  Do Barrier call 1.
    > > >  In rank 0 and host= RAID  Do Barrier call 2.
    > > >  In rank 0 and host= RAID  Do Barrier call 3.
    > > >  In rank 1 and host= RAID  Do Barrier call 1.
    > > >  In rank 1 and host= RAID  Do Barrier call 2.
    > > >  In rank 1 and host= RAID  Do Barrier call 3.
    > > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host
    centos,RAID a.out
    > > > /root/.bashrc: line 14: unalias: ls: not found
    > > >  In rank 0 and host= centos  Do Barrier call 1.
    > > >  In rank 0 and host= centos  Do Barrier call 2.
    > > > In rank 1 and host= RAID  Do Barrier call 1.
    > > >  In rank 1 and host= RAID  Do Barrier call 2.
    > > >
    > > >   Since it is such a simple test and problem and such a
    widely used MPI
    > > > function, it must obviously
    > > > be an installation or configuration problem.   A pstack for
    each of the
    > > > hung MPI_Barrier processes
    > > > on the two machines shows this:
    > > >
    > > > [root@centos ~]# pstack 31666
    > > > #0  0x0000003baf0e8ee3 in __epoll_wait_nocancel () from
    /lib64/libc.so.6
    > > > #1  0x00007f5de06125eb in epoll_dispatch () from
    > > /usr/local/lib/libmpi.so.1
    > > > #2  0x00007f5de061475a in opal_event_base_loop () from
    > > > /usr/local/lib/libmpi.so.1
    > > > #3  0x00007f5de0639229 in opal_progress () from
    > > /usr/local/lib/libmpi.so.1
    > > > #4  0x00007f5de0586f75 in ompi_request_default_wait_all () from
    > > > /usr/local/lib/libmpi.so.1
    > > > #5  0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual ()
    from
    > > > /usr/local/lib/openmpi/mca_coll_tuned.so
    > > > #6  0x00007f5ddc59d8ff in
    ompi_coll_tuned_barrier_intra_two_procs () from
    > > > /usr/local/lib/openmpi/mca_coll_tuned.so
    > > > #7  0x00007f5de05941c2 in PMPI_Barrier () from
    /usr/local/lib/libmpi.so.1
    > > > #8  0x0000000000400a43 in main ()
    > > >
    > > > [root@RAID openmpi-1.6.5]# pstack 22167
    > > > #0  0x00000030302e8ee3 in __epoll_wait_nocancel () from
    /lib64/libc.so.6
    > > > #1  0x00007f7ee46885eb in epoll_dispatch () from
    > > /usr/local/lib/libmpi.so.1
    > > > #2  0x00007f7ee468a75a in opal_event_base_loop () from
    > > > /usr/local/lib/libmpi.so.1
    > > > #3  0x00007f7ee46af229 in opal_progress () from
    > > /usr/local/lib/libmpi.so.1
    > > > #4  0x00007f7ee45fcf75 in ompi_request_default_wait_all () from
    > > > /usr/local/lib/libmpi.so.1
    > > > #5  0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual ()
    from
    > > > /usr/local/lib/openmpi/mca_coll_tuned.so
    > > > #6  0x00007f7ee06138ff in
    ompi_coll_tuned_barrier_intra_two_procs () from
    > > > /usr/local/lib/openmpi/mca_coll_tuned.so
    > > > #7  0x00007f7ee460a1c2 in PMPI_Barrier () from
    /usr/local/lib/libmpi.so.1
    > > > #8  0x0000000000400a43 in main ()
    > > >
    > > >  Which looks exactly the same on each machine.  Any thoughts
    or ideas
    > > would
    > > > be greatly appreciated as
    > > > I am stuck.
    > > >
    > > >  Clay Kirkland
    > > > -------------- next part --------------
    > > > HTML attachment scrubbed and removed
    > > >
    > > > ------------------------------
    > > >
    > > > Message: 2
    > > > Date: Sat, 3 May 2014 06:39:20 -0700
    > > > From: Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org>>
    > > > To: Open MPI Users <us...@open-mpi.org
    <mailto:us...@open-mpi.org>>
    > > > Subject: Re: [OMPI users] MPI_Barrier hangs on second
    attempt but only
    > > >         when    multiple hosts used.
    > > > Message-ID:
    <3cf53d73-15d9-40bb-a2de-50ba3561a...@open-mpi.org
    <mailto:3cf53d73-15d9-40bb-a2de-50ba3561a...@open-mpi.org>>
    > > > Content-Type: text/plain; charset="us-ascii"
    > > >
    > > > Hmmm...just testing on my little cluster here on two nodes,
    it works
    > > just fine with 1.8.2:
    > > >
    > > > [rhc@bend001 v1.8]$ mpirun -n 2 --map-by node ./a.out
    > > >  In rank 0 and host= bend001  Do Barrier call 1.
    > > >  In rank 0 and host= bend001  Do Barrier call 2.
    > > >  In rank 0 and host= bend001  Do Barrier call 3.
    > > >  In rank 1 and host= bend002  Do Barrier call 1.
    > > >  In rank 1 and host= bend002  Do Barrier call 2.
    > > >  In rank 1 and host= bend002  Do Barrier call 3.
    > > > [rhc@bend001 v1.8]$
    > > >
    > > >
    > > > How are you configuring OMPI?
    > > >
    > > >
    > > > On May 2, 2014, at 2:24 PM, Clay Kirkland
    <clay.kirkl...@versityinc.com <mailto:clay.kirkl...@versityinc.com>>
    > > wrote:
    > > >
    > > > > I have been using MPI for many many years so I have very
    well debugged
    > > mpi tests.   I am
    > > > > having trouble on either openmpi-1.4.5  or  openmpi-1.6.5
    versions
    > > though with getting the
    > > > > MPI_Barrier calls to work.   It works fine when I run all
    processes on
    > > one machine but when
    > > > > I run with two or more hosts the second call to
    MPI_Barrier always
    > > hangs.   Not the first one,
    > > > > but always the second one.   I looked at FAQ's and such
    but found
    > > nothing except for a comment
    > > > > that MPI_Barrier problems were often problems with fire
    walls.  Also
    > > mentioned as a problem
    > > > > was not having the same version of mpi on both machines.
     I turned
    > > firewalls off and removed
    > > > > and reinstalled the same version on both hosts but I still
    see the
    > > same thing.   I then installed
    > > > > lam mpi on two of my machines and that works fine.   I can
    call the
    > > MPI_Barrier function when run on
    > > > > one of two machines by itself  many times with no hangs.
     Only hangs
    > > if two or more hosts are involved.
    > > > > These runs are all being done on CentOS release 6.4.
    Here is test
    > > program I used.
    > > > >
    > > > > main (argc, argv)
    > > > > int argc;
    > > > > char **argv;
    > > > > {
    > > > >     char message[20];
    > > > >     char hoster[256];
    > > > >     char nameis[256];
    > > > >     int fd, i, j, jnp, iret, myrank, np, ranker, recker;
    > > > >     MPI_Comm comm;
    > > > >     MPI_Status status;
    > > > >
    > > > >     MPI_Init( &argc, &argv );
    > > > >     MPI_Comm_rank( MPI_COMM_WORLD, &myrank);
    > > > >     MPI_Comm_size( MPI_COMM_WORLD, &np);
    > > > >
    > > > >         gethostname(hoster,256);
    > > > >
    > > > >         printf(" In rank %d and host= %s  Do Barrier call
    > > 1.\n",myrank,hoster);
    > > > >     MPI_Barrier(MPI_COMM_WORLD);
    > > > >         printf(" In rank %d and host= %s  Do Barrier call
    > > 2.\n",myrank,hoster);
    > > > >     MPI_Barrier(MPI_COMM_WORLD);
    > > > >         printf(" In rank %d and host= %s  Do Barrier call
    > > 3.\n",myrank,hoster);
    > > > >     MPI_Barrier(MPI_COMM_WORLD);
    > > > >     MPI_Finalize();
    > > > >     exit(0);
    > > > > }
    > > > >
    > > > >   Here are three runs of test program.  First with two
    processes on
    > > one host, then with
    > > > > two processes on another host, and finally with one
    process on each of
    > > two hosts.  The
    > > > > first two runs are fine but the last run hangs on the second
    > > MPI_Barrier.
    > > > >
    > > > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host
    centos a.out
    > > > >  In rank 0 and host= centos  Do Barrier call 1.
    > > > >  In rank 1 and host= centos  Do Barrier call 1.
    > > > >  In rank 1 and host= centos  Do Barrier call 2.
    > > > >  In rank 1 and host= centos  Do Barrier call 3.
    > > > >  In rank 0 and host= centos  Do Barrier call 2.
    > > > >  In rank 0 and host= centos  Do Barrier call 3.
    > > > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID
    a.out
    > > > > /root/.bashrc: line 14: unalias: ls: not found
    > > > >  In rank 0 and host= RAID  Do Barrier call 1.
    > > > >  In rank 0 and host= RAID  Do Barrier call 2.
    > > > >  In rank 0 and host= RAID  Do Barrier call 3.
    > > > >  In rank 1 and host= RAID  Do Barrier call 1.
    > > > >  In rank 1 and host= RAID  Do Barrier call 2.
    > > > >  In rank 1 and host= RAID  Do Barrier call 3.
    > > > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host
    centos,RAID
    > > a.out
    > > > > /root/.bashrc: line 14: unalias: ls: not found
    > > > >  In rank 0 and host= centos  Do Barrier call 1.
    > > > >  In rank 0 and host= centos  Do Barrier call 2.
    > > > > In rank 1 and host= RAID  Do Barrier call 1.
    > > > >  In rank 1 and host= RAID  Do Barrier call 2.
    > > > >
    > > > >   Since it is such a simple test and problem and such a
    widely used
    > > MPI function, it must obviously
    > > > > be an installation or configuration problem.   A pstack
    for each of
    > > the hung MPI_Barrier processes
    > > > > on the two machines shows this:
    > > > >
    > > > > [root@centos ~]# pstack 31666
    > > > > #0  0x0000003baf0e8ee3 in __epoll_wait_nocancel () from
    > > /lib64/libc.so.6
    > > > > #1  0x00007f5de06125eb in epoll_dispatch () from
    > > /usr/local/lib/libmpi.so.1
    > > > > #2  0x00007f5de061475a in opal_event_base_loop () from
    > > /usr/local/lib/libmpi.so.1
    > > > > #3  0x00007f5de0639229 in opal_progress () from
    > > /usr/local/lib/libmpi.so.1
    > > > > #4  0x00007f5de0586f75 in ompi_request_default_wait_all ()
    from
    > > /usr/local/lib/libmpi.so.1
    > > > > #5  0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual
    () from
    > > /usr/local/lib/openmpi/mca_coll_tuned.so
    > > > > #6  0x00007f5ddc59d8ff in
    ompi_coll_tuned_barrier_intra_two_procs ()
    > > from /usr/local/lib/openmpi/mca_coll_tuned.so
    > > > > #7  0x00007f5de05941c2 in PMPI_Barrier () from
    > > /usr/local/lib/libmpi.so.1
    > > > > #8  0x0000000000400a43 in main ()
    > > > >
    > > > > [root@RAID openmpi-1.6.5]# pstack 22167
    > > > > #0  0x00000030302e8ee3 in __epoll_wait_nocancel () from
    > > /lib64/libc.so.6
    > > > > #1  0x00007f7ee46885eb in epoll_dispatch () from
    > > /usr/local/lib/libmpi.so.1
    > > > > #2  0x00007f7ee468a75a in opal_event_base_loop () from
    > > /usr/local/lib/libmpi.so.1
    > > > > #3  0x00007f7ee46af229 in opal_progress () from
    > > /usr/local/lib/libmpi.so.1
    > > > > #4  0x00007f7ee45fcf75 in ompi_request_default_wait_all ()
    from
    > > /usr/local/lib/libmpi.so.1
    > > > > #5  0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual
    () from
    > > /usr/local/lib/openmpi/mca_coll_tuned.so
    > > > > #6  0x00007f7ee06138ff in
    ompi_coll_tuned_barrier_intra_two_procs ()
    > > from /usr/local/lib/openmpi/mca_coll_tuned.so
    > > > > #7  0x00007f7ee460a1c2 in PMPI_Barrier () from
    > > /usr/local/lib/libmpi.so.1
    > > > > #8  0x0000000000400a43 in main ()
    > > > >
    > > > >  Which looks exactly the same on each machine.  Any
    thoughts or ideas
    > > would be greatly appreciated as
    > > > > I am stuck.
    > > > >
    > > > >  Clay Kirkland
    > > > >
    > > > >
    > > > >
    > > > >
    > > > > _______________________________________________
    > > > > users mailing list
    > > > > us...@open-mpi.org <mailto:us...@open-mpi.org>
    > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
    > > >
    > > > -------------- next part --------------
    > > > HTML attachment scrubbed and removed
    > > >
    > > > ------------------------------
    > > >
    > > > Subject: Digest Footer
    > > >
    > > > _______________________________________________
    > > > users mailing list
    > > > us...@open-mpi.org <mailto:us...@open-mpi.org>
    > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
    > > >
    > > > ------------------------------
    > > >
    > > > End of users Digest, Vol 2879, Issue 1
    > > > **************************************
    > > >
    > > > _______________________________________________
    > > > users mailing list
    > > > us...@open-mpi.org <mailto:us...@open-mpi.org>
    > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
    > >
    > >
    > > --
    > > Jeff Squyres
    > > jsquy...@cisco.com <mailto:jsquy...@cisco.com>
    > > For corporate legal information go to:
    > > http://www.cisco.com/web/about/doing_business/legal/cri/
    > >
    > >
    > >
    > > ------------------------------
    > >
    > > Subject: Digest Footer
    > >
    > > _______________________________________________
    > > users mailing list
    > > us...@open-mpi.org <mailto:us...@open-mpi.org>
    > > http://www.open-mpi.org/mailman/listinfo.cgi/users
    > >
    > > ------------------------------
    > >
    > > End of users Digest, Vol 2881, Issue 1
    > > **************************************
    > >
    > -------------- next part --------------
    > HTML attachment scrubbed and removed
    >
    > ------------------------------
    >
    > Message: 2
    > Date: Tue, 6 May 2014 18:50:50 -0500
    > From: Clay Kirkland <clay.kirkl...@versityinc.com
    <mailto:clay.kirkl...@versityinc.com>>
    > To: us...@open-mpi.org <mailto:us...@open-mpi.org>
    > Subject: Re: [OMPI users] users Digest, Vol 2881, Issue 1
    > Message-ID:
    >
    <cajdnja-u4btpto+87czsho81t+-a1jzottc7mwdfiar7+vz...@mail.gmail.com 
<mailto:cajdnja-u4btpto%2b87czsho81t%2b-a1jzottc7mwdfiar7%2bvz...@mail.gmail.com>>
    > Content-Type: text/plain; charset="utf-8"
    >
    >  Well it turns out  I can't seem to get all three of my machines
    on the
    > same page.
    > Two of them are using eth0 and one is using eth1.   Centos seems
    unable to
    > bring
    > up multiple network interfaces for some reason and when I use
    the mca param
    > to
    > use eth0 it works on two machines but not the other.   Is there
    some way to
    > use
    > only eth1 on one host and only eth0 on the other two?   Maybe
    environment
    > variables
    > but I can't seem to get that to work either.
    >
    >  Clay
    >
    >
    > On Tue, May 6, 2014 at 6:28 PM, Clay Kirkland
    > <clay.kirkl...@versityinc.com
    <mailto:clay.kirkl...@versityinc.com>>wrote:
    >
    > >  That last trick seems to work.  I can get it to work once in
    a while with
    > > those tcp options but it is
    > > tricky as I have three machines and two of them use eth0 as
    primary
    > > network interface and one
    > > uses eth1.   But by fiddling with network options and perhaps
    moving a
    > > cable or two I think I can
    > > get it all to work    Thanks much for the tip.
    > >
    > >  Clay
    > >
    > >
    > > On Tue, May 6, 2014 at 11:00 AM, <users-requ...@open-mpi.org
    <mailto:users-requ...@open-mpi.org>> wrote:
    > >
    > >> Send users mailing list submissions to
    > >> us...@open-mpi.org <mailto:us...@open-mpi.org>
    > >>
    > >> To subscribe or unsubscribe via the World Wide Web, visit
    > >> http://www.open-mpi.org/mailman/listinfo.cgi/users
    > >> or, via email, send a message with subject or body 'help' to
    > >> users-requ...@open-mpi.org <mailto:users-requ...@open-mpi.org>
    > >>
    > >> You can reach the person managing the list at
    > >> users-ow...@open-mpi.org <mailto:users-ow...@open-mpi.org>
    > >>
    > >> When replying, please edit your Subject line so it is more
    specific
    > >> than "Re: Contents of users digest..."
    > >>
    > >>
    > >> Today's Topics:
    > >>
    > >>    1. Re: MPI_Barrier hangs on second attempt but only  when
    > >>       multiple hosts used. (Daniels, Marcus G)
    > >>    2. ROMIO bug reading darrays (Richard Shaw)
    > >>    3. MPI File Open does not work (Imran Ali)
    > >>    4. Re: MPI File Open does not work (Jeff Squyres (jsquyres))
    > >>    5. Re: MPI File Open does not work (Imran Ali)
    > >>    6. Re: MPI File Open does not work (Jeff Squyres (jsquyres))
    > >>    7. Re: MPI File Open does not work (Imran Ali)
    > >>    8. Re: MPI File Open does not work (Jeff Squyres (jsquyres))
    > >>    9. Re: users Digest, Vol 2879, Issue 1 (Jeff Squyres
    (jsquyres))
    > >>
    > >>
    > >>
    ----------------------------------------------------------------------
    > >>
    > >> Message: 1
    > >> Date: Mon, 5 May 2014 19:28:07 +0000
    > >> From: "Daniels, Marcus G" <mdani...@lanl.gov
    <mailto:mdani...@lanl.gov>>
    > >> To: "'us...@open-mpi.org <mailto:us...@open-mpi.org>'"
    <us...@open-mpi.org <mailto:us...@open-mpi.org>>
    > >> Subject: Re: [OMPI users] MPI_Barrier hangs on second attempt
    but only
    > >>         when    multiple hosts used.
    > >> Message-ID:
    > >>         <
    > >>
    532c594b7920a549a2a91cb4312cc57640dc5...@ecs-exg-p-mb01.win.lanl.gov
    
<mailto:532c594b7920a549a2a91cb4312cc57640dc5...@ecs-exg-p-mb01.win.lanl.gov>>
    > >> Content-Type: text/plain; charset="utf-8"
    > >>
    > >>
    > >>
    > >> From: Clay Kirkland [mailto:clay.kirkl...@versityinc.com
    <mailto:clay.kirkl...@versityinc.com>]
    > >> Sent: Friday, May 02, 2014 03:24 PM
    > >> To: us...@open-mpi.org <mailto:us...@open-mpi.org>
    <us...@open-mpi.org <mailto:us...@open-mpi.org>>
    > >> Subject: [OMPI users] MPI_Barrier hangs on second attempt but
    only when
    > >> multiple hosts used.
    > >>
    > >> I have been using MPI for many many years so I have very well
    debugged
    > >> mpi tests.   I am
    > >> having trouble on either openmpi-1.4.5  or  openmpi-1.6.5
    versions though
    > >> with getting the
    > >> MPI_Barrier calls to work.   It works fine when I run all
    processes on
    > >> one machine but when
    > >> I run with two or more hosts the second call to MPI_Barrier
    always hangs.
    > >>   Not the first one,
    > >> but always the second one.   I looked at FAQ's and such but
    found nothing
    > >> except for a comment
    > >> that MPI_Barrier problems were often problems with fire
    walls.  Also
    > >> mentioned as a problem
    > >> was not having the same version of mpi on both machines.  I
    turned
    > >> firewalls off and removed
    > >> and reinstalled the same version on both hosts but I still
    see the same
    > >> thing.   I then installed
    > >> lam mpi on two of my machines and that works fine.   I can
    call the
    > >> MPI_Barrier function when run on
    > >> one of two machines by itself  many times with no hangs.
     Only hangs if
    > >> two or more hosts are involved.
    > >> These runs are all being done on CentOS release 6.4.   Here
    is test
    > >> program I used.
    > >>
    > >> main (argc, argv)
    > >> int argc;
    > >> char **argv;
    > >> {
    > >>     char message[20];
    > >>     char hoster[256];
    > >>     char nameis[256];
    > >>     int fd, i, j, jnp, iret, myrank, np, ranker, recker;
    > >>     MPI_Comm comm;
    > >>     MPI_Status status;
    > >>
    > >>     MPI_Init( &argc, &argv );
    > >>     MPI_Comm_rank( MPI_COMM_WORLD, &myrank);
    > >>     MPI_Comm_size( MPI_COMM_WORLD, &np);
    > >>
    > >>         gethostname(hoster,256);
    > >>
    > >>         printf(" In rank %d and host= %s  Do Barrier call
    > >> 1.\n",myrank,hoster);
    > >>     MPI_Barrier(MPI_COMM_WORLD);
    > >>         printf(" In rank %d and host= %s  Do Barrier call
    > >> 2.\n",myrank,hoster);
    > >>     MPI_Barrier(MPI_COMM_WORLD);
    > >>         printf(" In rank %d and host= %s  Do Barrier call
    > >> 3.\n",myrank,hoster);
    > >>     MPI_Barrier(MPI_COMM_WORLD);
    > >>     MPI_Finalize();
    > >>     exit(0);
    > >> }
    > >>
    > >>   Here are three runs of test program.  First with two
    processes on one
    > >> host, then with
    > >> two processes on another host, and finally with one process
    on each of
    > >> two hosts.  The
    > >> first two runs are fine but the last run hangs on the second
    MPI_Barrier.
    > >>
    > >> [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos
    a.out
    > >>  In rank 0 and host= centos  Do Barrier call 1.
    > >>  In rank 1 and host= centos  Do Barrier call 1.
    > >>  In rank 1 and host= centos  Do Barrier call 2.
    > >>  In rank 1 and host= centos  Do Barrier call 3.
    > >>  In rank 0 and host= centos  Do Barrier call 2.
    > >>  In rank 0 and host= centos  Do Barrier call 3.
    > >> [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out
    > >> /root/.bashrc: line 14: unalias: ls: not found
    > >>  In rank 0 and host= RAID  Do Barrier call 1.
    > >>  In rank 0 and host= RAID  Do Barrier call 2.
    > >>  In rank 0 and host= RAID  Do Barrier call 3.
    > >>  In rank 1 and host= RAID  Do Barrier call 1.
    > >>  In rank 1 and host= RAID  Do Barrier call 2.
    > >>  In rank 1 and host= RAID  Do Barrier call 3.
    > >> [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host
    centos,RAID a.out
    > >> /root/.bashrc: line 14: unalias: ls: not found
    > >>  In rank 0 and host= centos  Do Barrier call 1.
    > >>  In rank 0 and host= centos  Do Barrier call 2.
    > >> In rank 1 and host= RAID  Do Barrier call 1.
    > >>  In rank 1 and host= RAID  Do Barrier call 2.
    > >>
    > >>   Since it is such a simple test and problem and such a
    widely used MPI
    > >> function, it must obviously
    > >> be an installation or configuration problem.   A pstack for
    each of the
    > >> hung MPI_Barrier processes
    > >> on the two machines shows this:
    > >>
    > >> [root@centos ~]# pstack 31666
    > >> #0  0x0000003baf0e8ee3 in __epoll_wait_nocancel () from
    /lib64/libc.so.6
    > >> #1  0x00007f5de06125eb in epoll_dispatch () from
    > >> /usr/local/lib/libmpi.so.1
    > >> #2  0x00007f5de061475a in opal_event_base_loop () from
    > >> /usr/local/lib/libmpi.so.1
    > >> #3  0x00007f5de0639229 in opal_progress () from
    /usr/local/lib/libmpi.so.1
    > >> #4  0x00007f5de0586f75 in ompi_request_default_wait_all () from
    > >> /usr/local/lib/libmpi.so.1
    > >> #5  0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from
    > >> /usr/local/lib/openmpi/mca_coll_tuned.so
    > >> #6  0x00007f5ddc59d8ff in
    ompi_coll_tuned_barrier_intra_two_procs () from
    > >> /usr/local/lib/openmpi/mca_coll_tuned.so
    > >> #7  0x00007f5de05941c2 in PMPI_Barrier () from
    /usr/local/lib/libmpi.so.1
    > >> #8  0x0000000000400a43 in main ()
    > >>
    > >> [root@RAID openmpi-1.6.5]# pstack 22167
    > >> #0  0x00000030302e8ee3 in __epoll_wait_nocancel () from
    /lib64/libc.so.6
    > >> #1  0x00007f7ee46885eb in epoll_dispatch () from
    > >> /usr/local/lib/libmpi.so.1
    > >> #2  0x00007f7ee468a75a in opal_event_base_loop () from
    > >> /usr/local/lib/libmpi.so.1
    > >> #3  0x00007f7ee46af229 in opal_progress () from
    /usr/local/lib/libmpi.so.1
    > >> #4  0x00007f7ee45fcf75 in ompi_request_default_wait_all () from
    > >> /usr/local/lib/libmpi.so.1
    > >> #5  0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from
    > >> /usr/local/lib/openmpi/mca_coll_tuned.so
    > >> #6  0x00007f7ee06138ff in
    ompi_coll_tuned_barrier_intra_two_procs () from
    > >> /usr/local/lib/openmpi/mca_coll_tuned.so
    > >> #7  0x00007f7ee460a1c2 in PMPI_Barrier () from
    /usr/local/lib/libmpi.so.1
    > >> #8  0x0000000000400a43 in main ()
    > >>
    > >>  Which looks exactly the same on each machine.  Any thoughts
    or ideas
    > >> would be greatly appreciated as
    > >> I am stuck.
    > >>
    > >>  Clay Kirkland
    > >>
    > >>
    > >>
    > >>
    > >>
    > >>
    > >>
    > >>
    > >> -------------- next part --------------
    > >> HTML attachment scrubbed and removed
    > >>
    > >> ------------------------------
    > >>
    > >> Message: 2
    > >> Date: Mon, 5 May 2014 22:20:59 -0400
    > >> From: Richard Shaw <jr...@cita.utoronto.ca
    <mailto:jr...@cita.utoronto.ca>>
    > >> To: Open MPI Users <us...@open-mpi.org
    <mailto:us...@open-mpi.org>>
    > >> Subject: [OMPI users] ROMIO bug reading darrays
    > >> Message-ID:
    > >>         <
    > >>
    can+evmkc+9kacnpausscziufwdj3jfcsymb-8zdx1etdkab...@mail.gmail.com
    
<mailto:can%2bevmkc%2b9kacnpausscziufwdj3jfcsymb-8zdx1etdkab...@mail.gmail.com>>
    > >> Content-Type: text/plain; charset="utf-8"
    > >>
    > >> Hello,
    > >>
    > >> I think I've come across a bug when using ROMIO to read in a 2D
    > >> distributed
    > >> array. I've attached a test case to this email.
    > >>
    > >> In the testcase I first initialise an array of 25 doubles
    (which will be a
    > >> 5x5 grid), then I create a datatype representing a 5x5 matrix
    distributed
    > >> in 3x3 blocks over a 2x2 process grid. As a reference I use
    MPI_Pack to
    > >> pull out the block cyclic array elements local to each
    process (which I
    > >> think is correct). Then I write the original array of 25
    doubles to disk,
    > >> and use MPI-IO to read it back in (performing the Open,
    Set_view, and
    > >> Real_all), and compare to the reference.
    > >>
    > >> Running this with OMPI, the two match on all ranks.
    > >>
    > >> > mpirun -mca io ompio -np 4 ./darr_read.x
    > >> === Rank 0 === (9 elements)
    > >> Packed:  0.0  1.0  2.0  5.0  6.0  7.0 10.0 11.0 12.0
    > >> Read:    0.0  1.0  2.0  5.0  6.0  7.0 10.0 11.0 12.0
    > >>
    > >> === Rank 1 === (6 elements)
    > >> Packed: 15.0 16.0 17.0 20.0 21.0 22.0
    > >> Read:   15.0 16.0 17.0 20.0 21.0 22.0
    > >>
    > >> === Rank 2 === (6 elements)
    > >> Packed:  3.0  4.0  8.0  9.0 13.0 14.0
    > >> Read:    3.0  4.0  8.0  9.0 13.0 14.0
    > >>
    > >> === Rank 3 === (4 elements)
    > >> Packed: 18.0 19.0 23.0 24.0
    > >> Read:   18.0 19.0 23.0 24.0
    > >>
    > >>
    > >>
    > >> However, using ROMIO the two differ on two of the ranks:
    > >>
    > >> > mpirun -mca io romio -np 4 ./darr_read.x
    > >> === Rank 0 === (9 elements)
    > >> Packed:  0.0  1.0  2.0  5.0  6.0  7.0 10.0 11.0 12.0
    > >> Read:    0.0  1.0  2.0  5.0  6.0  7.0 10.0 11.0 12.0
    > >>
    > >> === Rank 1 === (6 elements)
    > >> Packed: 15.0 16.0 17.0 20.0 21.0 22.0
    > >> Read:    0.0  1.0  2.0  0.0  1.0  2.0
    > >>
    > >> === Rank 2 === (6 elements)
    > >> Packed:  3.0  4.0  8.0  9.0 13.0 14.0
    > >> Read:    3.0  4.0  8.0  9.0 13.0 14.0
    > >>
    > >> === Rank 3 === (4 elements)
    > >> Packed: 18.0 19.0 23.0 24.0
    > >> Read:    0.0  1.0  0.0  1.0
    > >>
    > >>
    > >>
    > >> My interpretation is that the behaviour with OMPIO is correct.
    > >> Interestingly everything matches up using both ROMIO and
    OMPIO if I set
    > >> the
    > >> block shape to 2x2.
    > >>
    > >> This was run on OS X using 1.8.2a1r31632. I have also run
    this on Linux
    > >> with OpenMPI 1.7.4, and OMPIO is still correct, but using
    ROMIO I just get
    > >> segfaults.
    > >>
    > >> Thanks,
    > >> Richard
    > >> -------------- next part --------------
    > >> HTML attachment scrubbed and removed
    > >> -------------- next part --------------
    > >> A non-text attachment was scrubbed...
    > >> Name: darr_read.c
    > >> Type: text/x-csrc
    > >> Size: 2218 bytes
    > >> Desc: not available
    > >> URL: <
    > >>
    
http://www.open-mpi.org/MailArchives/users/attachments/20140505/5a5ab0ba/attachment.bin
    > >> >
    > >>
    > >> ------------------------------
    > >>
    > >> Message: 3
    > >> Date: Tue, 06 May 2014 13:24:35 +0200
    > >> From: Imran Ali <imra...@student.matnat.uio.no
    <mailto:imra...@student.matnat.uio.no>>
    > >> To: <us...@open-mpi.org <mailto:us...@open-mpi.org>>
    > >> Subject: [OMPI users] MPI File Open does not work
    > >> Message-ID: <d57bdf499c00360b737205b085c50...@ulrik.uio.no
    <mailto:d57bdf499c00360b737205b085c50...@ulrik.uio.no>>
    > >> Content-Type: text/plain; charset="utf-8"
    > >>
    > >>
    > >>
    > >> I get the following error when I try to run the following python
    > >> code
    > >> import mpi4py.MPI as MPI
    > >> comm = MPI.COMM_WORLD
    > >>
    > >> MPI.File.Open(comm,"some.file")
    > >>
    > >> $ mpirun -np 1 python
    > >> test_mpi.py
    > >> Traceback (most recent call last):
    > >>  File "test_mpi.py", line
    > >> 3, in <module>
    > >>  MPI.File.Open(comm," h5ex_d_alloc.h5")
    > >>  File "File.pyx",
    > >> line 67, in mpi4py.MPI.File.Open
    > >> (src/mpi4py.MPI.c:89639)
    > >> mpi4py.MPI.Exception: MPI_ERR_OTHER: known
    > >> error not in
    > >> list
    > >>
    --------------------------------------------------------------------------
    > >> mpirun
    > >> noticed that the job aborted, but has no info as to the process
    > >> that
    > >> caused that
    > >> situation.
    > >>
    --------------------------------------------------------------------------
    > >>
    > >>
    > >> My mpirun version is (Open MPI) 1.6.2. I installed openmpi
    using the
    > >> dorsal script (https://github.com/FEniCS/dorsal) for Redhat
    Enterprise 6
    > >> (OS I am using, release 6.5) . It configured the build as
    following :
    > >>
    > >>
    > >> ./configure --enable-mpi-thread-multiple
    --enable-opal-multi-threads
    > >> --with-threads=posix --disable-mpi-profile
    > >>
    > >> I need emphasize that I do
    > >> not have root acces on the system I am running my application.
    > >>
    > >> Imran
    > >>
    > >>
    > >>
    > >> -------------- next part --------------
    > >> HTML attachment scrubbed and removed
    > >>
    > >> ------------------------------
    > >>
    > >> Message: 4
    > >> Date: Tue, 6 May 2014 12:56:04 +0000
    > >> From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com
    <mailto:jsquy...@cisco.com>>
    > >> To: Open MPI Users <us...@open-mpi.org
    <mailto:us...@open-mpi.org>>
    > >> Subject: Re: [OMPI users] MPI File Open does not work
    > >> Message-ID: <e7df28cb-d4fb-4087-928e-18e61d1d2...@cisco.com
    <mailto:e7df28cb-d4fb-4087-928e-18e61d1d2...@cisco.com>>
    > >> Content-Type: text/plain; charset="us-ascii"
    > >>
    > >> The thread support in the 1.6 series is not very good.  You
    might try:
    > >>
    > >> - Upgrading to 1.6.5
    > >> - Or better yet, upgrading to 1.8.1
    > >>
    > >>
    > >> On May 6, 2014, at 7:24 AM, Imran Ali
    <imra...@student.matnat.uio.no <mailto:imra...@student.matnat.uio.no>>
    > >> wrote:
    > >>
    > >> > I get the following error when I try to run the following
    python code
    > >> >
    > >> > import mpi4py.MPI as MPI
    > >> > comm =  MPI.COMM_WORLD
    > >> > MPI.File.Open(comm,"some.file")
    > >> >
    > >> > $ mpirun -np 1 python test_mpi.py
    > >> > Traceback (most recent call last):
    > >> >   File "test_mpi.py", line 3, in <module>
    > >> >     MPI.File.Open(comm," h5ex_d_alloc.h5")
    > >> >   File "File.pyx", line 67, in mpi4py.MPI.File.Open
    > >> (src/mpi4py.MPI.c:89639)
    > >> > mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list
    > >> >
    > >>
    --------------------------------------------------------------------------
    > >> > mpirun noticed that the job aborted, but has no info as to
    the process
    > >> > that caused that situation.
    > >> >
    > >>
    --------------------------------------------------------------------------
    > >> >
    > >> > My mpirun version is (Open MPI) 1.6.2. I installed openmpi
    using the
    > >> dorsal script (https://github.com/FEniCS/dorsal) for Redhat
    Enterprise 6
    > >> (OS I am using, release 6.5) . It configured the build as
    following :
    > >> >
    > >> > ./configure --enable-mpi-thread-multiple
    --enable-opal-multi-threads
    > >> --with-threads=posix --disable-mpi-profile
    > >> >
    > >> > I need emphasize that I do not have root acces on the
    system I am
    > >> running my application.
    > >> >
    > >> > Imran
    > >> >
    > >> >
    > >> > _______________________________________________
    > >> > users mailing list
    > >> > us...@open-mpi.org <mailto:us...@open-mpi.org>
    > >> > http://www.open-mpi.org/mailman/listinfo.cgi/users
    > >>
    > >>
    > >> --
    > >> Jeff Squyres
    > >> jsquy...@cisco.com <mailto:jsquy...@cisco.com>
    > >> For corporate legal information go to:
    > >> http://www.cisco.com/web/about/doing_business/legal/cri/
    > >>
    > >>
    > >>
    > >> ------------------------------
    > >>
    > >> Message: 5
    > >> Date: Tue, 6 May 2014 15:32:21 +0200
    > >> From: Imran Ali <imra...@student.matnat.uio.no
    <mailto:imra...@student.matnat.uio.no>>
    > >> To: Open MPI Users <us...@open-mpi.org
    <mailto:us...@open-mpi.org>>
    > >> Subject: Re: [OMPI users] MPI File Open does not work
    > >> Message-ID: <fa6dffff-6c66-4a47-84fc-148fb51ce...@math.uio.no
    <mailto:fa6dffff-6c66-4a47-84fc-148fb51ce...@math.uio.no>>
    > >> Content-Type: text/plain; charset=us-ascii
    > >>
    > >>
    > >> 6. mai 2014 kl. 14:56 skrev Jeff Squyres (jsquyres)
    <jsquy...@cisco.com <mailto:jsquy...@cisco.com>>:
    > >>
    > >> > The thread support in the 1.6 series is not very good.  You
    might try:
    > >> >
    > >> > - Upgrading to 1.6.5
    > >> > - Or better yet, upgrading to 1.8.1
    > >> >
    > >>
    > >> I will attempt that than. I read at
    > >>
    > >> http://www.open-mpi.org/faq/?category=building#install-overwrite
    > >>
    > >> that I should completely uninstall my previous version. Could you
    > >> recommend to me how I can go about doing it (without root
    access).
    > >> I am uncertain where I can use make uninstall.
    > >>
    > >> Imran
    > >>
    > >> >
    > >> > On May 6, 2014, at 7:24 AM, Imran Ali
    <imra...@student.matnat.uio.no <mailto:imra...@student.matnat.uio.no>>
    > >> wrote:
    > >> >
    > >> >> I get the following error when I try to run the following
    python code
    > >> >>
    > >> >> import mpi4py.MPI as MPI
    > >> >> comm =  MPI.COMM_WORLD
    > >> >> MPI.File.Open(comm,"some.file")
    > >> >>
    > >> >> $ mpirun -np 1 python test_mpi.py
    > >> >> Traceback (most recent call last):
    > >> >>  File "test_mpi.py", line 3, in <module>
    > >> >>    MPI.File.Open(comm," h5ex_d_alloc.h5")
    > >> >>  File "File.pyx", line 67, in mpi4py.MPI.File.Open
    > >> (src/mpi4py.MPI.c:89639)
    > >> >> mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list
    > >> >>
    > >>
    --------------------------------------------------------------------------
    > >> >> mpirun noticed that the job aborted, but has no info as to
    the process
    > >> >> that caused that situation.
    > >> >>
    > >>
    --------------------------------------------------------------------------
    > >> >>
    > >> >> My mpirun version is (Open MPI) 1.6.2. I installed openmpi
    using the
    > >> dorsal script (https://github.com/FEniCS/dorsal) for Redhat
    Enterprise 6
    > >> (OS I am using, release 6.5) . It configured the build as
    following :
    > >> >>
    > >> >> ./configure --enable-mpi-thread-multiple
    --enable-opal-multi-threads
    > >> --with-threads=posix --disable-mpi-profile
    > >> >>
    > >> >> I need emphasize that I do not have root acces on the
    system I am
    > >> running my application.
    > >> >>
    > >> >> Imran
    > >> >>
    > >> >>
    > >> >> _______________________________________________
    > >> >> users mailing list
    > >> >> us...@open-mpi.org <mailto:us...@open-mpi.org>
    > >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
    > >> >
    > >> >
    > >> > --
    > >> > Jeff Squyres
    > >> > jsquy...@cisco.com <mailto:jsquy...@cisco.com>
    > >> > For corporate legal information go to:
    > >> http://www.cisco.com/web/about/doing_business/legal/cri/
    > >> >
    > >> > _______________________________________________
    > >> > users mailing list
    > >> > us...@open-mpi.org <mailto:us...@open-mpi.org>
    > >> > http://www.open-mpi.org/mailman/listinfo.cgi/users
    > >>
    > >>
    > >>
    > >> ------------------------------
    > >>
    > >> Message: 6
    > >> Date: Tue, 6 May 2014 13:34:38 +0000
    > >> From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com
    <mailto:jsquy...@cisco.com>>
    > >> To: Open MPI Users <us...@open-mpi.org
    <mailto:us...@open-mpi.org>>
    > >> Subject: Re: [OMPI users] MPI File Open does not work
    > >> Message-ID: <2a933c0e-80f6-4ded-b44c-53b5f37ef...@cisco.com
    <mailto:2a933c0e-80f6-4ded-b44c-53b5f37ef...@cisco.com>>
    > >> Content-Type: text/plain; charset="us-ascii"
    > >>
    > >> On May 6, 2014, at 9:32 AM, Imran Ali
    <imra...@student.matnat.uio.no <mailto:imra...@student.matnat.uio.no>>
    > >> wrote:
    > >>
    > >> > I will attempt that than. I read at
    > >> >
    > >> >
    http://www.open-mpi.org/faq/?category=building#install-overwrite
    > >> >
    > >> > that I should completely uninstall my previous version.
    > >>
    > >> Yes, that is best.  OR: you can install into a whole separate
    tree and
    > >> ignore the first installation.
    > >>
    > >> > Could you recommend to me how I can go about doing it
    (without root
    > >> access).
    > >> > I am uncertain where I can use make uninstall.
    > >>
    > >> If you don't have write access into the installation tree
    (i.e., it was
    > >> installed via root and you don't have root access), then your
    best bet is
    > >> simply to install into a new tree.  E.g., if OMPI is
    installed into
    > >> /opt/openmpi-1.6.2, try installing into /opt/openmpi-1.6.5,
    or even
    > >> $HOME/installs/openmpi-1.6.5, or something like that.
    > >>
    > >> --
    > >> Jeff Squyres
    > >> jsquy...@cisco.com <mailto:jsquy...@cisco.com>
    > >> For corporate legal information go to:
    > >> http://www.cisco.com/web/about/doing_business/legal/cri/
    > >>
    > >>
    > >>
    > >> ------------------------------
    > >>
    > >> Message: 7
    > >> Date: Tue, 6 May 2014 15:40:34 +0200
    > >> From: Imran Ali <imra...@student.matnat.uio.no
    <mailto:imra...@student.matnat.uio.no>>
    > >> To: Open MPI Users <us...@open-mpi.org
    <mailto:us...@open-mpi.org>>
    > >> Subject: Re: [OMPI users] MPI File Open does not work
    > >> Message-ID: <14f0596c-c5c5-4b1a-a4a8-8849d44ab...@math.uio.no
    <mailto:14f0596c-c5c5-4b1a-a4a8-8849d44ab...@math.uio.no>>
    > >> Content-Type: text/plain; charset=us-ascii
    > >>
    > >>
    > >> 6. mai 2014 kl. 15:34 skrev Jeff Squyres (jsquyres)
    <jsquy...@cisco.com <mailto:jsquy...@cisco.com>>:
    > >>
    > >> > On May 6, 2014, at 9:32 AM, Imran Ali
    <imra...@student.matnat.uio.no <mailto:imra...@student.matnat.uio.no>>
    > >> wrote:
    > >> >
    > >> >> I will attempt that than. I read at
    > >> >>
    > >> >>
    http://www.open-mpi.org/faq/?category=building#install-overwrite
    > >> >>
    > >> >> that I should completely uninstall my previous version.
    > >> >
    > >> > Yes, that is best.  OR: you can install into a whole
    separate tree and
    > >> ignore the first installation.
    > >> >
    > >> >> Could you recommend to me how I can go about doing it
    (without root
    > >> access).
    > >> >> I am uncertain where I can use make uninstall.
    > >> >
    > >> > If you don't have write access into the installation tree
    (i.e., it was
    > >> installed via root and you don't have root access), then your
    best bet is
    > >> simply to install into a new tree.  E.g., if OMPI is
    installed into
    > >> /opt/openmpi-1.6.2, try installing into /opt/openmpi-1.6.5,
    or even
    > >> $HOME/installs/openmpi-1.6.5, or something like that.
    > >>
    > >> My install was in my user directory (i.e $HOME). I managed to
    locate the
    > >> source directory and successfully run make uninstall.
    > >>
    > >> Will let you know how things went after installation.
    > >>
    > >> Imran
    > >>
    > >> >
    > >> > --
    > >> > Jeff Squyres
    > >> > jsquy...@cisco.com <mailto:jsquy...@cisco.com>
    > >> > For corporate legal information go to:
    > >> http://www.cisco.com/web/about/doing_business/legal/cri/
    > >> >
    > >> > _______________________________________________
    > >> > users mailing list
    > >> > us...@open-mpi.org <mailto:us...@open-mpi.org>
    > >> > http://www.open-mpi.org/mailman/listinfo.cgi/users
    > >>
    > >>
    > >>
    > >> ------------------------------
    > >>
    > >> Message: 8
    > >> Date: Tue, 6 May 2014 14:42:52 +0000
    > >> From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com
    <mailto:jsquy...@cisco.com>>
    > >> To: Open MPI Users <us...@open-mpi.org
    <mailto:us...@open-mpi.org>>
    > >> Subject: Re: [OMPI users] MPI File Open does not work
    > >> Message-ID: <710e3328-edaa-4a13-9f07-b45fe3191...@cisco.com
    <mailto:710e3328-edaa-4a13-9f07-b45fe3191...@cisco.com>>
    > >> Content-Type: text/plain; charset="us-ascii"
    > >>
    > >> On May 6, 2014, at 9:40 AM, Imran Ali
    <imra...@student.matnat.uio.no <mailto:imra...@student.matnat.uio.no>>
    > >> wrote:
    > >>
    > >> > My install was in my user directory (i.e $HOME). I managed
    to locate
    > >> the source directory and successfully run make uninstall.
    > >>
    > >>
    > >> FWIW, I usually install Open MPI into its own subdir.  E.g.,
    > >> $HOME/installs/openmpi-x.y.z.  Then if I don't want that
    install any more,
    > >> I can just "rm -rf $HOME/installs/openmpi-x.y.z" -- no need
    to "make
    > >> uninstall".  Specifically: if there's nothing else installed
    in the same
    > >> tree as Open MPI, you can just rm -rf its installation tree.
    > >>
    > >> --
    > >> Jeff Squyres
    > >> jsquy...@cisco.com <mailto:jsquy...@cisco.com>
    > >> For corporate legal information go to:
    > >> http://www.cisco.com/web/about/doing_business/legal/cri/
    > >>
    > >>
    > >>
    > >> ------------------------------
    > >>
    > >> Message: 9
    > >> Date: Tue, 6 May 2014 14:50:34 +0000
    > >> From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com
    <mailto:jsquy...@cisco.com>>
    > >> To: Open MPI Users <us...@open-mpi.org
    <mailto:us...@open-mpi.org>>
    > >> Subject: Re: [OMPI users] users Digest, Vol 2879, Issue 1
    > >> Message-ID: <c60aa7e1-96a7-4ccd-9b5b-11a38fb87...@cisco.com
    <mailto:c60aa7e1-96a7-4ccd-9b5b-11a38fb87...@cisco.com>>
    > >> Content-Type: text/plain; charset="us-ascii"
    > >>
    > >> Are you using TCP as the MPI transport?
    > >>
    > >> If so, another thing to try is to limit the IP interfaces
    that MPI uses
    > >> for its traffic to see if there's some kind of problem with
    specific
    > >> networks.
    > >>
    > >> For example:
    > >>
    > >>    mpirun --mca btl_tcp_if_include eth0 ...
    > >>
    > >> If that works, then try adding in any/all other IP interfaces
    that you
    > >> have on your machines.
    > >>
    > >> A sorta-wild guess: you have some IP interfaces that aren't
    working, or
    > >> at least, don't work in the way that OMPI wants them to work.
     So the first
    > >> barrier works because it flows across eth0 (or some other
    first network
    > >> that, as far as OMPI is concerned, works just fine).  But
    then the next
    > >> barrier round-robin advances to the next IP interface, and it
    doesn't work
    > >> for some reason.
    > >>
    > >> We've seen virtual machine bridge interfaces cause problems,
    for example.
    > >>  E.g., if a machine has a Xen virtual machine interface
    (vibr0, IIRC?),
    > >> then OMPI will try to use it to communicate with peer MPI
    processes because
    > >> it has a "compatible" IP address, and OMPI think it should be
    > >> connected/reachable to peers.  If this is the case, you might
    want to
    > >> disable such interfaces and/or use btl_tcp_if_include or
    btl_tcp_if_exclude
    > >> to select the interfaces that you want to use.
    > >>
    > >> Pro tip: if you use btl_tcp_if_exclude, remember to exclude
    the loopback
    > >> interface, too.  OMPI defaults to a btl_tcp_if_include=""
    (blank) and
    > >> btl_tcp_if_exclude="lo0". So if you override
    btl_tcp_if_exclude, you need
    > >> to be sure to *also* include lo0 in the new value.  For example:
    > >>
    > >>    mpirun --mca btl_tcp_if_exclude lo0,virb0 ...
    > >>
    > >> Also, if possible, try upgrading to Open MPI 1.8.1.
    > >>
    > >>
    > >>
    > >> On May 4, 2014, at 2:15 PM, Clay Kirkland
    <clay.kirkl...@versityinc.com <mailto:clay.kirkl...@versityinc.com>>
    > >> wrote:
    > >>
    > >> >  I am configuring with all defaults.   Just doing a
    ./configure and then
    > >> > make and make install.   I have used open mpi on several
    kinds of
    > >> > unix  systems this way and have had no trouble before.   I
    believe I
    > >> > last had success on a redhat version of linux.
    > >> >
    > >> >
    > >> > On Sat, May 3, 2014 at 11:00 AM,
    <users-requ...@open-mpi.org <mailto:users-requ...@open-mpi.org>>
    wrote:
    > >> > Send users mailing list submissions to
    > >> > us...@open-mpi.org <mailto:us...@open-mpi.org>
    > >> >
    > >> > To subscribe or unsubscribe via the World Wide Web, visit
    > >> > http://www.open-mpi.org/mailman/listinfo.cgi/users
    > >> > or, via email, send a message with subject or body 'help' to
    > >> > users-requ...@open-mpi.org <mailto:users-requ...@open-mpi.org>
    > >> >
    > >> > You can reach the person managing the list at
    > >> > users-ow...@open-mpi.org <mailto:users-ow...@open-mpi.org>
    > >> >
    > >> > When replying, please edit your Subject line so it is more
    specific
    > >> > than "Re: Contents of users digest..."
    > >> >
    > >> >
    > >> > Today's Topics:
    > >> >
    > >> >    1. MPI_Barrier hangs on second attempt but only when
    multiple
    > >> >       hosts used. (Clay Kirkland)
    > >> >    2. Re: MPI_Barrier hangs on second attempt but only when
    > >> >       multiple hosts used. (Ralph Castain)
    > >> >
    > >> >
    > >> >
    ----------------------------------------------------------------------
    > >> >
    > >> > Message: 1
    > >> > Date: Fri, 2 May 2014 16:24:04 -0500
    > >> > From: Clay Kirkland <clay.kirkl...@versityinc.com
    <mailto:clay.kirkl...@versityinc.com>>
    > >> > To: us...@open-mpi.org <mailto:us...@open-mpi.org>
    > >> > Subject: [OMPI users] MPI_Barrier hangs on second attempt
    but only
    > >> >         when    multiple hosts used.
    > >> > Message-ID:
    > >> >         <CAJDnjA8Wi=FEjz6Vz+Bc34b+nFE=
    > >> tf4b7g0bqgmbekg7h-p...@mail.gmail.com
    <mailto:tf4b7g0bqgmbekg7h-pv%...@mail.gmail.com>>
    > >> > Content-Type: text/plain; charset="utf-8"
    > >> >
    > >> > I have been using MPI for many many years so I have very
    well debugged
    > >> mpi
    > >> > tests.   I am
    > >> > having trouble on either openmpi-1.4.5  or  openmpi-1.6.5
    versions
    > >> though
    > >> > with getting the
    > >> > MPI_Barrier calls to work.   It works fine when I run all
    processes on
    > >> one
    > >> > machine but when
    > >> > I run with two or more hosts the second call to MPI_Barrier
    always
    > >> hangs.
    > >> > Not the first one,
    > >> > but always the second one.   I looked at FAQ's and such but
    found
    > >> nothing
    > >> > except for a comment
    > >> > that MPI_Barrier problems were often problems with fire
    walls.  Also
    > >> > mentioned as a problem
    > >> > was not having the same version of mpi on both machines.  I
    turned
    > >> > firewalls off and removed
    > >> > and reinstalled the same version on both hosts but I still
    see the same
    > >> > thing.   I then installed
    > >> > lam mpi on two of my machines and that works fine.   I can
    call the
    > >> > MPI_Barrier function when run on
    > >> > one of two machines by itself  many times with no hangs.
     Only hangs if
    > >> two
    > >> > or more hosts are involved.
    > >> > These runs are all being done on CentOS release 6.4.   Here
    is test
    > >> program
    > >> > I used.
    > >> >
    > >> > main (argc, argv)
    > >> > int argc;
    > >> > char **argv;
    > >> > {
    > >> >     char message[20];
    > >> >     char hoster[256];
    > >> >     char nameis[256];
    > >> >     int fd, i, j, jnp, iret, myrank, np, ranker, recker;
    > >> >     MPI_Comm comm;
    > >> >     MPI_Status status;
    > >> >
    > >> >     MPI_Init( &argc, &argv );
    > >> >     MPI_Comm_rank( MPI_COMM_WORLD, &myrank);
    > >> >     MPI_Comm_size( MPI_COMM_WORLD, &np);
    > >> >
    > >> >         gethostname(hoster,256);
    > >> >
    > >> >         printf(" In rank %d and host= %s  Do Barrier call
    > >> > 1.\n",myrank,hoster);
    > >> >     MPI_Barrier(MPI_COMM_WORLD);
    > >> >         printf(" In rank %d and host= %s  Do Barrier call
    > >> > 2.\n",myrank,hoster);
    > >> >     MPI_Barrier(MPI_COMM_WORLD);
    > >> >         printf(" In rank %d and host= %s  Do Barrier call
    > >> > 3.\n",myrank,hoster);
    > >> >     MPI_Barrier(MPI_COMM_WORLD);
    > >> >     MPI_Finalize();
    > >> >     exit(0);
    > >> > }
    > >> >
    > >> >   Here are three runs of test program.  First with two
    processes on one
    > >> > host, then with
    > >> > two processes on another host, and finally with one process
    on each of
    > >> two
    > >> > hosts.  The
    > >> > first two runs are fine but the last run hangs on the second
    > >> MPI_Barrier.
    > >> >
    > >> > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host
    centos a.out
    > >> >  In rank 0 and host= centos  Do Barrier call 1.
    > >> >  In rank 1 and host= centos  Do Barrier call 1.
    > >> >  In rank 1 and host= centos  Do Barrier call 2.
    > >> >  In rank 1 and host= centos  Do Barrier call 3.
    > >> >  In rank 0 and host= centos  Do Barrier call 2.
    > >> >  In rank 0 and host= centos  Do Barrier call 3.
    > >> > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID
    a.out
    > >> > /root/.bashrc: line 14: unalias: ls: not found
    > >> >  In rank 0 and host= RAID  Do Barrier call 1.
    > >> >  In rank 0 and host= RAID  Do Barrier call 2.
    > >> >  In rank 0 and host= RAID  Do Barrier call 3.
    > >> >  In rank 1 and host= RAID  Do Barrier call 1.
    > >> >  In rank 1 and host= RAID  Do Barrier call 2.
    > >> >  In rank 1 and host= RAID  Do Barrier call 3.
    > >> > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host
    centos,RAID a.out
    > >> > /root/.bashrc: line 14: unalias: ls: not found
    > >> >  In rank 0 and host= centos  Do Barrier call 1.
    > >> >  In rank 0 and host= centos  Do Barrier call 2.
    > >> > In rank 1 and host= RAID  Do Barrier call 1.
    > >> >  In rank 1 and host= RAID  Do Barrier call 2.
    > >> >
    > >> >   Since it is such a simple test and problem and such a
    widely used MPI
    > >> > function, it must obviously
    > >> > be an installation or configuration problem.   A pstack for
    each of the
    > >> > hung MPI_Barrier processes
    > >> > on the two machines shows this:
    > >> >
    > >> > [root@centos ~]# pstack 31666
    > >> > #0  0x0000003baf0e8ee3 in __epoll_wait_nocancel () from
    /lib64/libc.so.6
    > >> > #1  0x00007f5de06125eb in epoll_dispatch () from
    > >> /usr/local/lib/libmpi.so.1
    > >> > #2  0x00007f5de061475a in opal_event_base_loop () from
    > >> > /usr/local/lib/libmpi.so.1
    > >> > #3  0x00007f5de0639229 in opal_progress () from
    > >> /usr/local/lib/libmpi.so.1
    > >> > #4  0x00007f5de0586f75 in ompi_request_default_wait_all () from
    > >> > /usr/local/lib/libmpi.so.1
    > >> > #5  0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual
    () from
    > >> > /usr/local/lib/openmpi/mca_coll_tuned.so
    > >> > #6  0x00007f5ddc59d8ff in
    ompi_coll_tuned_barrier_intra_two_procs ()
    > >> from
    > >> > /usr/local/lib/openmpi/mca_coll_tuned.so
    > >> > #7  0x00007f5de05941c2 in PMPI_Barrier () from
    > >> /usr/local/lib/libmpi.so.1
    > >> > #8  0x0000000000400a43 in main ()
    > >> >
    > >> > [root@RAID openmpi-1.6.5]# pstack 22167
    > >> > #0  0x00000030302e8ee3 in __epoll_wait_nocancel () from
    /lib64/libc.so.6
    > >> > #1  0x00007f7ee46885eb in epoll_dispatch () from
    > >> /usr/local/lib/libmpi.so.1
    > >> > #2  0x00007f7ee468a75a in opal_event_base_loop () from
    > >> > /usr/local/lib/libmpi.so.1
    > >> > #3  0x00007f7ee46af229 in opal_progress () from
    > >> /usr/local/lib/libmpi.so.1
    > >> > #4  0x00007f7ee45fcf75 in ompi_request_default_wait_all () from
    > >> > /usr/local/lib/libmpi.so.1
    > >> > #5  0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual
    () from
    > >> > /usr/local/lib/openmpi/mca_coll_tuned.so
    > >> > #6  0x00007f7ee06138ff in
    ompi_coll_tuned_barrier_intra_two_procs ()
    > >> from
    > >> > /usr/local/lib/openmpi/mca_coll_tuned.so
    > >> > #7  0x00007f7ee460a1c2 in PMPI_Barrier () from
    > >> /usr/local/lib/libmpi.so.1
    > >> > #8  0x0000000000400a43 in main ()
    > >> >
    > >> >  Which looks exactly the same on each machine.  Any
    thoughts or ideas
    > >> would
    > >> > be greatly appreciated as
    > >> > I am stuck.
    > >> >
    > >> >  Clay Kirkland
    > >> > -------------- next part --------------
    > >> > HTML attachment scrubbed and removed
    > >> >
    > >> > ------------------------------
    > >> >
    > >> > Message: 2
    > >> > Date: Sat, 3 May 2014 06:39:20 -0700
    > >> > From: Ralph Castain <r...@open-mpi.org
    <mailto:r...@open-mpi.org>>
    > >> > To: Open MPI Users <us...@open-mpi.org
    <mailto:us...@open-mpi.org>>
    > >> > Subject: Re: [OMPI users] MPI_Barrier hangs on second
    attempt but only
    > >> >         when    multiple hosts used.
    > >> > Message-ID:
    <3cf53d73-15d9-40bb-a2de-50ba3561a...@open-mpi.org
    <mailto:3cf53d73-15d9-40bb-a2de-50ba3561a...@open-mpi.org>>
    > >> > Content-Type: text/plain; charset="us-ascii"
    > >> >
    > >> > Hmmm...just testing on my little cluster here on two nodes,
    it works
    > >> just fine with 1.8.2:
    > >> >
    > >> > [rhc@bend001 v1.8]$ mpirun -n 2 --map-by node ./a.out
    > >> >  In rank 0 and host= bend001  Do Barrier call 1.
    > >> >  In rank 0 and host= bend001  Do Barrier call 2.
    > >> >  In rank 0 and host= bend001  Do Barrier call 3.
    > >> >  In rank 1 and host= bend002  Do Barrier call 1.
    > >> >  In rank 1 and host= bend002  Do Barrier call 2.
    > >> >  In rank 1 and host= bend002  Do Barrier call 3.
    > >> > [rhc@bend001 v1.8]$
    > >> >
    > >> >
    > >> > How are you configuring OMPI?
    > >> >
    > >> >
    > >> > On May 2, 2014, at 2:24 PM, Clay Kirkland
    <clay.kirkl...@versityinc.com <mailto:clay.kirkl...@versityinc.com>>
    > >> wrote:
    > >> >
    > >> > > I have been using MPI for many many years so I have very well
    > >> debugged mpi tests.   I am
    > >> > > having trouble on either openmpi-1.4.5  or  openmpi-1.6.5
    versions
    > >> though with getting the
    > >> > > MPI_Barrier calls to work.   It works fine when I run all
    processes
    > >> on one machine but when
    > >> > > I run with two or more hosts the second call to
    MPI_Barrier always
    > >> hangs.   Not the first one,
    > >> > > but always the second one.   I looked at FAQ's and such
    but found
    > >> nothing except for a comment
    > >> > > that MPI_Barrier problems were often problems with fire
    walls.  Also
    > >> mentioned as a problem
    > >> > > was not having the same version of mpi on both machines.
     I turned
    > >> firewalls off and removed
    > >> > > and reinstalled the same version on both hosts but I
    still see the
    > >> same thing.   I then installed
    > >> > > lam mpi on two of my machines and that works fine.   I
    can call the
    > >> MPI_Barrier function when run on
    > >> > > one of two machines by itself  many times with no hangs.
     Only hangs
    > >> if two or more hosts are involved.
    > >> > > These runs are all being done on CentOS release 6.4.
    Here is test
    > >> program I used.
    > >> > >
    > >> > > main (argc, argv)
    > >> > > int argc;
    > >> > > char **argv;
    > >> > > {
    > >> > >     char message[20];
    > >> > >     char hoster[256];
    > >> > >     char nameis[256];
    > >> > >     int fd, i, j, jnp, iret, myrank, np, ranker, recker;
    > >> > >     MPI_Comm comm;
    > >> > >     MPI_Status status;
    > >> > >
    > >> > >     MPI_Init( &argc, &argv );
    > >> > >     MPI_Comm_rank( MPI_COMM_WORLD, &myrank);
    > >> > >     MPI_Comm_size( MPI_COMM_WORLD, &np);
    > >> > >
    > >> > >         gethostname(hoster,256);
    > >> > >
    > >> > >         printf(" In rank %d and host= %s  Do Barrier call
    > >> 1.\n",myrank,hoster);
    > >> > >     MPI_Barrier(MPI_COMM_WORLD);
    > >> > >         printf(" In rank %d and host= %s  Do Barrier call
    > >> 2.\n",myrank,hoster);
    > >> > >     MPI_Barrier(MPI_COMM_WORLD);
    > >> > >         printf(" In rank %d and host= %s  Do Barrier call
    > >> 3.\n",myrank,hoster);
    > >> > >     MPI_Barrier(MPI_COMM_WORLD);
    > >> > >     MPI_Finalize();
    > >> > >     exit(0);
    > >> > > }
    > >> > >
    > >> > >   Here are three runs of test program.  First with two
    processes on
    > >> one host, then with
    > >> > > two processes on another host, and finally with one
    process on each
    > >> of two hosts.  The
    > >> > > first two runs are fine but the last run hangs on the second
    > >> MPI_Barrier.
    > >> > >
    > >> > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host
    centos a.out
    > >> > >  In rank 0 and host= centos  Do Barrier call 1.
    > >> > >  In rank 1 and host= centos  Do Barrier call 1.
    > >> > >  In rank 1 and host= centos  Do Barrier call 2.
    > >> > >  In rank 1 and host= centos  Do Barrier call 3.
    > >> > >  In rank 0 and host= centos  Do Barrier call 2.
    > >> > >  In rank 0 and host= centos  Do Barrier call 3.
    > >> > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host
    RAID a.out
    > >> > > /root/.bashrc: line 14: unalias: ls: not found
    > >> > >  In rank 0 and host= RAID  Do Barrier call 1.
    > >> > >  In rank 0 and host= RAID  Do Barrier call 2.
    > >> > >  In rank 0 and host= RAID  Do Barrier call 3.
    > >> > >  In rank 1 and host= RAID  Do Barrier call 1.
    > >> > >  In rank 1 and host= RAID  Do Barrier call 2.
    > >> > >  In rank 1 and host= RAID  Do Barrier call 3.
    > >> > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host
    centos,RAID
    > >> a.out
    > >> > > /root/.bashrc: line 14: unalias: ls: not found
    > >> > >  In rank 0 and host= centos  Do Barrier call 1.
    > >> > >  In rank 0 and host= centos  Do Barrier call 2.
    > >> > > In rank 1 and host= RAID  Do Barrier call 1.
    > >> > >  In rank 1 and host= RAID  Do Barrier call 2.
    > >> > >
    > >> > >   Since it is such a simple test and problem and such a
    widely used
    > >> MPI function, it must obviously
    > >> > > be an installation or configuration problem.   A pstack
    for each of
    > >> the hung MPI_Barrier processes
    > >> > > on the two machines shows this:
    > >> > >
    > >> > > [root@centos ~]# pstack 31666
    > >> > > #0  0x0000003baf0e8ee3 in __epoll_wait_nocancel () from
    > >> /lib64/libc.so.6
    > >> > > #1  0x00007f5de06125eb in epoll_dispatch () from
    > >> /usr/local/lib/libmpi.so.1
    > >> > > #2  0x00007f5de061475a in opal_event_base_loop () from
    > >> /usr/local/lib/libmpi.so.1
    > >> > > #3  0x00007f5de0639229 in opal_progress () from
    > >> /usr/local/lib/libmpi.so.1
    > >> > > #4  0x00007f5de0586f75 in ompi_request_default_wait_all
    () from
    > >> /usr/local/lib/libmpi.so.1
    > >> > > #5  0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual
    () from
    > >> /usr/local/lib/openmpi/mca_coll_tuned.so
    > >> > > #6  0x00007f5ddc59d8ff in
    ompi_coll_tuned_barrier_intra_two_procs ()
    > >> from /usr/local/lib/openmpi/mca_coll_tuned.so
    > >> > > #7  0x00007f5de05941c2 in PMPI_Barrier () from
    > >> /usr/local/lib/libmpi.so.1
    > >> > > #8  0x0000000000400a43 in main ()
    > >> > >
    > >> > > [root@RAID openmpi-1.6.5]# pstack 22167
    > >> > > #0  0x00000030302e8ee3 in __epoll_wait_nocancel () from
    > >> /lib64/libc.so.6
    > >> > > #1  0x00007f7ee46885eb in epoll_dispatch () from
    > >> /usr/local/lib/libmpi.so.1
    > >> > > #2  0x00007f7ee468a75a in opal_event_base_loop () from
    > >> /usr/local/lib/libmpi.so.1
    > >> > > #3  0x00007f7ee46af229 in opal_progress () from
    > >> /usr/local/lib/libmpi.so.1
    > >> > > #4  0x00007f7ee45fcf75 in ompi_request_default_wait_all
    () from
    > >> /usr/local/lib/libmpi.so.1
    > >> > > #5  0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual
    () from
    > >> /usr/local/lib/openmpi/mca_coll_tuned.so
    > >> > > #6  0x00007f7ee06138ff in
    ompi_coll_tuned_barrier_intra_two_procs ()
    > >> from /usr/local/lib/openmpi/mca_coll_tuned.so
    > >> > > #7  0x00007f7ee460a1c2 in PMPI_Barrier () from
    > >> /usr/local/lib/libmpi.so.1
    > >> > > #8  0x0000000000400a43 in main ()
    > >> > >
    > >> > >  Which looks exactly the same on each machine.  Any
    thoughts or ideas
    > >> would be greatly appreciated as
    > >> > > I am stuck.
    > >> > >
    > >> > >  Clay Kirkland
    > >> > >
    > >> > >
    > >> > >
    > >> > >
    > >> > > _______________________________________________
    > >> > > users mailing list
    > >> > > us...@open-mpi.org <mailto:us...@open-mpi.org>
    > >> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
    > >> >
    > >> > -------------- next part --------------
    > >> > HTML attachment scrubbed and removed
    > >> >
    > >> > ------------------------------
    > >> >
    > >> > Subject: Digest Footer
    > >> >
    > >> > _______________________________________________
    > >> > users mailing list
    > >> > us...@open-mpi.org <mailto:us...@open-mpi.org>
    > >> > http://www.open-mpi.org/mailman/listinfo.cgi/users
    > >> >
    > >> > ------------------------------
    > >> >
    > >> > End of users Digest, Vol 2879, Issue 1
    > >> > **************************************
    > >> >
    > >> > _______________________________________________
    > >> > users mailing list
    > >> > us...@open-mpi.org <mailto:us...@open-mpi.org>
    > >> > http://www.open-mpi.org/mailman/listinfo.cgi/users
    > >>
    > >>
    > >> --
    > >> Jeff Squyres
    > >> jsquy...@cisco.com <mailto:jsquy...@cisco.com>
    > >> For corporate legal information go to:
    > >> http://www.cisco.com/web/about/doing_business/legal/cri/
    > >>
    > >>
    > >>
    > >> ------------------------------
    > >>
    > >> Subject: Digest Footer
    > >>
    > >> _______________________________________________
    > >> users mailing list
    > >> us...@open-mpi.org <mailto:us...@open-mpi.org>
    > >> http://www.open-mpi.org/mailman/listinfo.cgi/users
    > >>
    > >> ------------------------------
    > >>
    > >> End of users Digest, Vol 2881, Issue 1
    > >> **************************************
    > >>
    > >
    > >
    > -------------- next part --------------
    > HTML attachment scrubbed and removed
    >
    > ------------------------------
    >
    > Subject: Digest Footer
    >
    > _______________________________________________
    > users mailing list
    > us...@open-mpi.org <mailto:us...@open-mpi.org>
    > http://www.open-mpi.org/mailman/listinfo.cgi/users
    >
    > ------------------------------
    >
    > End of users Digest, Vol 2881, Issue 2
    > **************************************
    >
    > _______________________________________________
    > users mailing list
    > us...@open-mpi.org <mailto:us...@open-mpi.org>
    > http://www.open-mpi.org/mailman/listinfo.cgi/users

    -------------- next part --------------
    HTML attachment scrubbed and removed

    ------------------------------

Reply via email to