Well it turns out  I can't seem to get all three of my machines on the
same page.
Two of them are using eth0 and one is using eth1.   Centos seems unable to
bring
up multiple network interfaces for some reason and when I use the mca param
to
use eth0 it works on two machines but not the other.   Is there some way to
use
only eth1 on one host and only eth0 on the other two?   Maybe environment
variables
but I can't seem to get that to work either.

 Clay


On Tue, May 6, 2014 at 6:28 PM, Clay Kirkland
<clay.kirkl...@versityinc.com>wrote:

>  That last trick seems to work.  I can get it to work once in a while with
> those tcp options but it is
> tricky as I have three machines and two of them use eth0 as primary
> network interface and one
> uses eth1.   But by fiddling with network options and perhaps moving a
> cable or two I think I can
> get it all to work    Thanks much for the tip.
>
>  Clay
>
>
> On Tue, May 6, 2014 at 11:00 AM, <users-requ...@open-mpi.org> wrote:
>
>> Send users mailing list submissions to
>>         us...@open-mpi.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>         http://www.open-mpi.org/mailman/listinfo.cgi/users
>> or, via email, send a message with subject or body 'help' to
>>         users-requ...@open-mpi.org
>>
>> You can reach the person managing the list at
>>         users-ow...@open-mpi.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of users digest..."
>>
>>
>> Today's Topics:
>>
>>    1. Re: MPI_Barrier hangs on second attempt but only  when
>>       multiple hosts used. (Daniels, Marcus G)
>>    2. ROMIO bug reading darrays (Richard Shaw)
>>    3. MPI File Open does not work (Imran Ali)
>>    4. Re: MPI File Open does not work (Jeff Squyres (jsquyres))
>>    5. Re: MPI File Open does not work (Imran Ali)
>>    6. Re: MPI File Open does not work (Jeff Squyres (jsquyres))
>>    7. Re: MPI File Open does not work (Imran Ali)
>>    8. Re: MPI File Open does not work (Jeff Squyres (jsquyres))
>>    9. Re: users Digest, Vol 2879, Issue 1 (Jeff Squyres (jsquyres))
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Mon, 5 May 2014 19:28:07 +0000
>> From: "Daniels, Marcus G" <mdani...@lanl.gov>
>> To: "'us...@open-mpi.org'" <us...@open-mpi.org>
>> Subject: Re: [OMPI users] MPI_Barrier hangs on second attempt but only
>>         when    multiple hosts used.
>> Message-ID:
>>         <
>> 532c594b7920a549a2a91cb4312cc57640dc5...@ecs-exg-p-mb01.win.lanl.gov>
>> Content-Type: text/plain; charset="utf-8"
>>
>>
>>
>> From: Clay Kirkland [mailto:clay.kirkl...@versityinc.com]
>> Sent: Friday, May 02, 2014 03:24 PM
>> To: us...@open-mpi.org <us...@open-mpi.org>
>> Subject: [OMPI users] MPI_Barrier hangs on second attempt but only when
>> multiple hosts used.
>>
>> I have been using MPI for many many years so I have very well debugged
>> mpi tests.   I am
>> having trouble on either openmpi-1.4.5  or  openmpi-1.6.5 versions though
>> with getting the
>> MPI_Barrier calls to work.   It works fine when I run all processes on
>> one machine but when
>> I run with two or more hosts the second call to MPI_Barrier always hangs.
>>   Not the first one,
>> but always the second one.   I looked at FAQ's and such but found nothing
>> except for a comment
>> that MPI_Barrier problems were often problems with fire walls.  Also
>> mentioned as a problem
>> was not having the same version of mpi on both machines.  I turned
>> firewalls off and removed
>> and reinstalled the same version on both hosts but I still see the same
>> thing.   I then installed
>> lam mpi on two of my machines and that works fine.   I can call the
>> MPI_Barrier function when run on
>> one of two machines by itself  many times with no hangs.  Only hangs if
>> two or more hosts are involved.
>> These runs are all being done on CentOS release 6.4.   Here is test
>> program I used.
>>
>> main (argc, argv)
>> int argc;
>> char **argv;
>> {
>>     char message[20];
>>     char hoster[256];
>>     char nameis[256];
>>     int fd, i, j, jnp, iret, myrank, np, ranker, recker;
>>     MPI_Comm comm;
>>     MPI_Status status;
>>
>>     MPI_Init( &argc, &argv );
>>     MPI_Comm_rank( MPI_COMM_WORLD, &myrank);
>>     MPI_Comm_size( MPI_COMM_WORLD, &np);
>>
>>         gethostname(hoster,256);
>>
>>         printf(" In rank %d and host= %s  Do Barrier call
>> 1.\n",myrank,hoster);
>>     MPI_Barrier(MPI_COMM_WORLD);
>>         printf(" In rank %d and host= %s  Do Barrier call
>> 2.\n",myrank,hoster);
>>     MPI_Barrier(MPI_COMM_WORLD);
>>         printf(" In rank %d and host= %s  Do Barrier call
>> 3.\n",myrank,hoster);
>>     MPI_Barrier(MPI_COMM_WORLD);
>>     MPI_Finalize();
>>     exit(0);
>> }
>>
>>   Here are three runs of test program.  First with two processes on one
>> host, then with
>> two processes on another host, and finally with one process on each of
>> two hosts.  The
>> first two runs are fine but the last run hangs on the second MPI_Barrier.
>>
>> [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out
>>  In rank 0 and host= centos  Do Barrier call 1.
>>  In rank 1 and host= centos  Do Barrier call 1.
>>  In rank 1 and host= centos  Do Barrier call 2.
>>  In rank 1 and host= centos  Do Barrier call 3.
>>  In rank 0 and host= centos  Do Barrier call 2.
>>  In rank 0 and host= centos  Do Barrier call 3.
>> [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out
>> /root/.bashrc: line 14: unalias: ls: not found
>>  In rank 0 and host= RAID  Do Barrier call 1.
>>  In rank 0 and host= RAID  Do Barrier call 2.
>>  In rank 0 and host= RAID  Do Barrier call 3.
>>  In rank 1 and host= RAID  Do Barrier call 1.
>>  In rank 1 and host= RAID  Do Barrier call 2.
>>  In rank 1 and host= RAID  Do Barrier call 3.
>> [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos,RAID a.out
>> /root/.bashrc: line 14: unalias: ls: not found
>>  In rank 0 and host= centos  Do Barrier call 1.
>>  In rank 0 and host= centos  Do Barrier call 2.
>> In rank 1 and host= RAID  Do Barrier call 1.
>>  In rank 1 and host= RAID  Do Barrier call 2.
>>
>>   Since it is such a simple test and problem and such a widely used MPI
>> function, it must obviously
>> be an installation or configuration problem.   A pstack for each of the
>> hung MPI_Barrier processes
>> on the two machines shows this:
>>
>> [root@centos ~]# pstack 31666
>> #0  0x0000003baf0e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6
>> #1  0x00007f5de06125eb in epoll_dispatch () from
>> /usr/local/lib/libmpi.so.1
>> #2  0x00007f5de061475a in opal_event_base_loop () from
>> /usr/local/lib/libmpi.so.1
>> #3  0x00007f5de0639229 in opal_progress () from /usr/local/lib/libmpi.so.1
>> #4  0x00007f5de0586f75 in ompi_request_default_wait_all () from
>> /usr/local/lib/libmpi.so.1
>> #5  0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from
>> /usr/local/lib/openmpi/mca_coll_tuned.so
>> #6  0x00007f5ddc59d8ff in ompi_coll_tuned_barrier_intra_two_procs () from
>> /usr/local/lib/openmpi/mca_coll_tuned.so
>> #7  0x00007f5de05941c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1
>> #8  0x0000000000400a43 in main ()
>>
>> [root@RAID openmpi-1.6.5]# pstack 22167
>> #0  0x00000030302e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6
>> #1  0x00007f7ee46885eb in epoll_dispatch () from
>> /usr/local/lib/libmpi.so.1
>> #2  0x00007f7ee468a75a in opal_event_base_loop () from
>> /usr/local/lib/libmpi.so.1
>> #3  0x00007f7ee46af229 in opal_progress () from /usr/local/lib/libmpi.so.1
>> #4  0x00007f7ee45fcf75 in ompi_request_default_wait_all () from
>> /usr/local/lib/libmpi.so.1
>> #5  0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from
>> /usr/local/lib/openmpi/mca_coll_tuned.so
>> #6  0x00007f7ee06138ff in ompi_coll_tuned_barrier_intra_two_procs () from
>> /usr/local/lib/openmpi/mca_coll_tuned.so
>> #7  0x00007f7ee460a1c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1
>> #8  0x0000000000400a43 in main ()
>>
>>  Which looks exactly the same on each machine.  Any thoughts or ideas
>> would be greatly appreciated as
>> I am stuck.
>>
>>  Clay Kirkland
>>
>>
>>
>>
>>
>>
>>
>>
>> -------------- next part --------------
>> HTML attachment scrubbed and removed
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Mon, 5 May 2014 22:20:59 -0400
>> From: Richard Shaw <jr...@cita.utoronto.ca>
>> To: Open MPI Users <us...@open-mpi.org>
>> Subject: [OMPI users] ROMIO bug reading darrays
>> Message-ID:
>>         <
>> can+evmkc+9kacnpausscziufwdj3jfcsymb-8zdx1etdkab...@mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hello,
>>
>> I think I've come across a bug when using ROMIO to read in a 2D
>> distributed
>> array. I've attached a test case to this email.
>>
>> In the testcase I first initialise an array of 25 doubles (which will be a
>> 5x5 grid), then I create a datatype representing a 5x5 matrix distributed
>> in 3x3 blocks over a 2x2 process grid. As a reference I use MPI_Pack to
>> pull out the block cyclic array elements local to each process (which I
>> think is correct). Then I write the original array of 25 doubles to disk,
>> and use MPI-IO to read it back in (performing the Open, Set_view, and
>> Real_all), and compare to the reference.
>>
>> Running this with OMPI, the two match on all ranks.
>>
>> > mpirun -mca io ompio -np 4 ./darr_read.x
>> === Rank 0 === (9 elements)
>> Packed:  0.0  1.0  2.0  5.0  6.0  7.0 10.0 11.0 12.0
>> Read:    0.0  1.0  2.0  5.0  6.0  7.0 10.0 11.0 12.0
>>
>> === Rank 1 === (6 elements)
>> Packed: 15.0 16.0 17.0 20.0 21.0 22.0
>> Read:   15.0 16.0 17.0 20.0 21.0 22.0
>>
>> === Rank 2 === (6 elements)
>> Packed:  3.0  4.0  8.0  9.0 13.0 14.0
>> Read:    3.0  4.0  8.0  9.0 13.0 14.0
>>
>> === Rank 3 === (4 elements)
>> Packed: 18.0 19.0 23.0 24.0
>> Read:   18.0 19.0 23.0 24.0
>>
>>
>>
>> However, using ROMIO the two differ on two of the ranks:
>>
>> > mpirun -mca io romio -np 4 ./darr_read.x
>> === Rank 0 === (9 elements)
>> Packed:  0.0  1.0  2.0  5.0  6.0  7.0 10.0 11.0 12.0
>> Read:    0.0  1.0  2.0  5.0  6.0  7.0 10.0 11.0 12.0
>>
>> === Rank 1 === (6 elements)
>> Packed: 15.0 16.0 17.0 20.0 21.0 22.0
>> Read:    0.0  1.0  2.0  0.0  1.0  2.0
>>
>> === Rank 2 === (6 elements)
>> Packed:  3.0  4.0  8.0  9.0 13.0 14.0
>> Read:    3.0  4.0  8.0  9.0 13.0 14.0
>>
>> === Rank 3 === (4 elements)
>> Packed: 18.0 19.0 23.0 24.0
>> Read:    0.0  1.0  0.0  1.0
>>
>>
>>
>> My interpretation is that the behaviour with OMPIO is correct.
>> Interestingly everything matches up using both ROMIO and OMPIO if I set
>> the
>> block shape to 2x2.
>>
>> This was run on OS X using 1.8.2a1r31632. I have also run this on Linux
>> with OpenMPI 1.7.4, and OMPIO is still correct, but using ROMIO I just get
>> segfaults.
>>
>> Thanks,
>> Richard
>> -------------- next part --------------
>> HTML attachment scrubbed and removed
>> -------------- next part --------------
>> A non-text attachment was scrubbed...
>> Name: darr_read.c
>> Type: text/x-csrc
>> Size: 2218 bytes
>> Desc: not available
>> URL: <
>> http://www.open-mpi.org/MailArchives/users/attachments/20140505/5a5ab0ba/attachment.bin
>> >
>>
>> ------------------------------
>>
>> Message: 3
>> Date: Tue, 06 May 2014 13:24:35 +0200
>> From: Imran Ali <imra...@student.matnat.uio.no>
>> To: <us...@open-mpi.org>
>> Subject: [OMPI users] MPI File Open does not work
>> Message-ID: <d57bdf499c00360b737205b085c50...@ulrik.uio.no>
>> Content-Type: text/plain; charset="utf-8"
>>
>>
>>
>> I get the following error when I try to run the following python
>> code
>> import mpi4py.MPI as MPI
>> comm = MPI.COMM_WORLD
>>
>> MPI.File.Open(comm,"some.file")
>>
>> $ mpirun -np 1 python
>> test_mpi.py
>> Traceback (most recent call last):
>>  File "test_mpi.py", line
>> 3, in <module>
>>  MPI.File.Open(comm," h5ex_d_alloc.h5")
>>  File "File.pyx",
>> line 67, in mpi4py.MPI.File.Open
>> (src/mpi4py.MPI.c:89639)
>> mpi4py.MPI.Exception: MPI_ERR_OTHER: known
>> error not in
>> list
>> --------------------------------------------------------------------------
>> mpirun
>> noticed that the job aborted, but has no info as to the process
>> that
>> caused that
>> situation.
>> --------------------------------------------------------------------------
>>
>>
>> My mpirun version is (Open MPI) 1.6.2. I installed openmpi using the
>> dorsal script (https://github.com/FEniCS/dorsal) for Redhat Enterprise 6
>> (OS I am using, release 6.5) . It configured the build as following :
>>
>>
>> ./configure --enable-mpi-thread-multiple --enable-opal-multi-threads
>> --with-threads=posix --disable-mpi-profile
>>
>> I need emphasize that I do
>> not have root acces on the system I am running my application.
>>
>> Imran
>>
>>
>>
>> -------------- next part --------------
>> HTML attachment scrubbed and removed
>>
>> ------------------------------
>>
>> Message: 4
>> Date: Tue, 6 May 2014 12:56:04 +0000
>> From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com>
>> To: Open MPI Users <us...@open-mpi.org>
>> Subject: Re: [OMPI users] MPI File Open does not work
>> Message-ID: <e7df28cb-d4fb-4087-928e-18e61d1d2...@cisco.com>
>> Content-Type: text/plain; charset="us-ascii"
>>
>> The thread support in the 1.6 series is not very good.  You might try:
>>
>> - Upgrading to 1.6.5
>> - Or better yet, upgrading to 1.8.1
>>
>>
>> On May 6, 2014, at 7:24 AM, Imran Ali <imra...@student.matnat.uio.no>
>> wrote:
>>
>> > I get the following error when I try to run the following python code
>> >
>> > import mpi4py.MPI as MPI
>> > comm =  MPI.COMM_WORLD
>> > MPI.File.Open(comm,"some.file")
>> >
>> > $ mpirun -np 1 python test_mpi.py
>> > Traceback (most recent call last):
>> >   File "test_mpi.py", line 3, in <module>
>> >     MPI.File.Open(comm," h5ex_d_alloc.h5")
>> >   File "File.pyx", line 67, in mpi4py.MPI.File.Open
>> (src/mpi4py.MPI.c:89639)
>> > mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list
>> >
>> --------------------------------------------------------------------------
>> > mpirun noticed that the job aborted, but has no info as to the process
>> > that caused that situation.
>> >
>> --------------------------------------------------------------------------
>> >
>> > My mpirun version is (Open MPI) 1.6.2. I installed openmpi using the
>> dorsal script (https://github.com/FEniCS/dorsal) for Redhat Enterprise 6
>> (OS I am using, release 6.5) . It configured the build as following :
>> >
>> > ./configure --enable-mpi-thread-multiple --enable-opal-multi-threads
>> --with-threads=posix --disable-mpi-profile
>> >
>> > I need emphasize that I do not have root acces on the system I am
>> running my application.
>> >
>> > Imran
>> >
>> >
>> > _______________________________________________
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>>
>> ------------------------------
>>
>> Message: 5
>> Date: Tue, 6 May 2014 15:32:21 +0200
>> From: Imran Ali <imra...@student.matnat.uio.no>
>> To: Open MPI Users <us...@open-mpi.org>
>> Subject: Re: [OMPI users] MPI File Open does not work
>> Message-ID: <fa6dffff-6c66-4a47-84fc-148fb51ce...@math.uio.no>
>> Content-Type: text/plain; charset=us-ascii
>>
>>
>> 6. mai 2014 kl. 14:56 skrev Jeff Squyres (jsquyres) <jsquy...@cisco.com>:
>>
>> > The thread support in the 1.6 series is not very good.  You might try:
>> >
>> > - Upgrading to 1.6.5
>> > - Or better yet, upgrading to 1.8.1
>> >
>>
>> I will attempt that than. I read at
>>
>> http://www.open-mpi.org/faq/?category=building#install-overwrite
>>
>> that I should completely uninstall my previous version. Could you
>> recommend to me how I can go about doing it (without root access).
>> I am uncertain where I can use make uninstall.
>>
>> Imran
>>
>> >
>> > On May 6, 2014, at 7:24 AM, Imran Ali <imra...@student.matnat.uio.no>
>> wrote:
>> >
>> >> I get the following error when I try to run the following python code
>> >>
>> >> import mpi4py.MPI as MPI
>> >> comm =  MPI.COMM_WORLD
>> >> MPI.File.Open(comm,"some.file")
>> >>
>> >> $ mpirun -np 1 python test_mpi.py
>> >> Traceback (most recent call last):
>> >>  File "test_mpi.py", line 3, in <module>
>> >>    MPI.File.Open(comm," h5ex_d_alloc.h5")
>> >>  File "File.pyx", line 67, in mpi4py.MPI.File.Open
>> (src/mpi4py.MPI.c:89639)
>> >> mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list
>> >>
>> --------------------------------------------------------------------------
>> >> mpirun noticed that the job aborted, but has no info as to the process
>> >> that caused that situation.
>> >>
>> --------------------------------------------------------------------------
>> >>
>> >> My mpirun version is (Open MPI) 1.6.2. I installed openmpi using the
>> dorsal script (https://github.com/FEniCS/dorsal) for Redhat Enterprise 6
>> (OS I am using, release 6.5) . It configured the build as following :
>> >>
>> >> ./configure --enable-mpi-thread-multiple --enable-opal-multi-threads
>> --with-threads=posix --disable-mpi-profile
>> >>
>> >> I need emphasize that I do not have root acces on the system I am
>> running my application.
>> >>
>> >> Imran
>> >>
>> >>
>> >> _______________________________________________
>> >> users mailing list
>> >> us...@open-mpi.org
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> > --
>> > Jeff Squyres
>> > jsquy...@cisco.com
>> > For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> >
>> > _______________________________________________
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> ------------------------------
>>
>> Message: 6
>> Date: Tue, 6 May 2014 13:34:38 +0000
>> From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com>
>> To: Open MPI Users <us...@open-mpi.org>
>> Subject: Re: [OMPI users] MPI File Open does not work
>> Message-ID: <2a933c0e-80f6-4ded-b44c-53b5f37ef...@cisco.com>
>> Content-Type: text/plain; charset="us-ascii"
>>
>> On May 6, 2014, at 9:32 AM, Imran Ali <imra...@student.matnat.uio.no>
>> wrote:
>>
>> > I will attempt that than. I read at
>> >
>> > http://www.open-mpi.org/faq/?category=building#install-overwrite
>> >
>> > that I should completely uninstall my previous version.
>>
>> Yes, that is best.  OR: you can install into a whole separate tree and
>> ignore the first installation.
>>
>> > Could you recommend to me how I can go about doing it (without root
>> access).
>> > I am uncertain where I can use make uninstall.
>>
>> If you don't have write access into the installation tree (i.e., it was
>> installed via root and you don't have root access), then your best bet is
>> simply to install into a new tree.  E.g., if OMPI is installed into
>> /opt/openmpi-1.6.2, try installing into /opt/openmpi-1.6.5, or even
>> $HOME/installs/openmpi-1.6.5, or something like that.
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>>
>> ------------------------------
>>
>> Message: 7
>> Date: Tue, 6 May 2014 15:40:34 +0200
>> From: Imran Ali <imra...@student.matnat.uio.no>
>> To: Open MPI Users <us...@open-mpi.org>
>> Subject: Re: [OMPI users] MPI File Open does not work
>> Message-ID: <14f0596c-c5c5-4b1a-a4a8-8849d44ab...@math.uio.no>
>> Content-Type: text/plain; charset=us-ascii
>>
>>
>> 6. mai 2014 kl. 15:34 skrev Jeff Squyres (jsquyres) <jsquy...@cisco.com>:
>>
>> > On May 6, 2014, at 9:32 AM, Imran Ali <imra...@student.matnat.uio.no>
>> wrote:
>> >
>> >> I will attempt that than. I read at
>> >>
>> >> http://www.open-mpi.org/faq/?category=building#install-overwrite
>> >>
>> >> that I should completely uninstall my previous version.
>> >
>> > Yes, that is best.  OR: you can install into a whole separate tree and
>> ignore the first installation.
>> >
>> >> Could you recommend to me how I can go about doing it (without root
>> access).
>> >> I am uncertain where I can use make uninstall.
>> >
>> > If you don't have write access into the installation tree (i.e., it was
>> installed via root and you don't have root access), then your best bet is
>> simply to install into a new tree.  E.g., if OMPI is installed into
>> /opt/openmpi-1.6.2, try installing into /opt/openmpi-1.6.5, or even
>> $HOME/installs/openmpi-1.6.5, or something like that.
>>
>> My install was in my user directory (i.e $HOME). I managed to locate the
>> source directory and successfully run make uninstall.
>>
>> Will let you know how things went after installation.
>>
>> Imran
>>
>> >
>> > --
>> > Jeff Squyres
>> > jsquy...@cisco.com
>> > For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> >
>> > _______________________________________________
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> ------------------------------
>>
>> Message: 8
>> Date: Tue, 6 May 2014 14:42:52 +0000
>> From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com>
>> To: Open MPI Users <us...@open-mpi.org>
>> Subject: Re: [OMPI users] MPI File Open does not work
>> Message-ID: <710e3328-edaa-4a13-9f07-b45fe3191...@cisco.com>
>> Content-Type: text/plain; charset="us-ascii"
>>
>> On May 6, 2014, at 9:40 AM, Imran Ali <imra...@student.matnat.uio.no>
>> wrote:
>>
>> > My install was in my user directory (i.e $HOME). I managed to locate
>> the source directory and successfully run make uninstall.
>>
>>
>> FWIW, I usually install Open MPI into its own subdir.  E.g.,
>> $HOME/installs/openmpi-x.y.z.  Then if I don't want that install any more,
>> I can just "rm -rf $HOME/installs/openmpi-x.y.z" -- no need to "make
>> uninstall".  Specifically: if there's nothing else installed in the same
>> tree as Open MPI, you can just rm -rf its installation tree.
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>>
>> ------------------------------
>>
>> Message: 9
>> Date: Tue, 6 May 2014 14:50:34 +0000
>> From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com>
>> To: Open MPI Users <us...@open-mpi.org>
>> Subject: Re: [OMPI users] users Digest, Vol 2879, Issue 1
>> Message-ID: <c60aa7e1-96a7-4ccd-9b5b-11a38fb87...@cisco.com>
>> Content-Type: text/plain; charset="us-ascii"
>>
>> Are you using TCP as the MPI transport?
>>
>> If so, another thing to try is to limit the IP interfaces that MPI uses
>> for its traffic to see if there's some kind of problem with specific
>> networks.
>>
>> For example:
>>
>>    mpirun --mca btl_tcp_if_include eth0 ...
>>
>> If that works, then try adding in any/all other IP interfaces that you
>> have on your machines.
>>
>> A sorta-wild guess: you have some IP interfaces that aren't working, or
>> at least, don't work in the way that OMPI wants them to work.  So the first
>> barrier works because it flows across eth0 (or some other first network
>> that, as far as OMPI is concerned, works just fine).  But then the next
>> barrier round-robin advances to the next IP interface, and it doesn't work
>> for some reason.
>>
>> We've seen virtual machine bridge interfaces cause problems, for example.
>>  E.g., if a machine has a Xen virtual machine interface (vibr0, IIRC?),
>> then OMPI will try to use it to communicate with peer MPI processes because
>> it has a "compatible" IP address, and OMPI think it should be
>> connected/reachable to peers.  If this is the case, you might want to
>> disable such interfaces and/or use btl_tcp_if_include or btl_tcp_if_exclude
>> to select the interfaces that you want to use.
>>
>> Pro tip: if you use btl_tcp_if_exclude, remember to exclude the loopback
>> interface, too.  OMPI defaults to a btl_tcp_if_include="" (blank) and
>> btl_tcp_if_exclude="lo0". So if you override btl_tcp_if_exclude, you need
>> to be sure to *also* include lo0 in the new value.  For example:
>>
>>    mpirun --mca btl_tcp_if_exclude lo0,virb0 ...
>>
>> Also, if possible, try upgrading to Open MPI 1.8.1.
>>
>>
>>
>> On May 4, 2014, at 2:15 PM, Clay Kirkland <clay.kirkl...@versityinc.com>
>> wrote:
>>
>> >  I am configuring with all defaults.   Just doing a ./configure and then
>> > make and make install.   I have used open mpi on several kinds of
>> > unix  systems this way and have had no trouble before.   I believe I
>> > last had success on a redhat version of linux.
>> >
>> >
>> > On Sat, May 3, 2014 at 11:00 AM, <users-requ...@open-mpi.org> wrote:
>> > Send users mailing list submissions to
>> >         us...@open-mpi.org
>> >
>> > To subscribe or unsubscribe via the World Wide Web, visit
>> >         http://www.open-mpi.org/mailman/listinfo.cgi/users
>> > or, via email, send a message with subject or body 'help' to
>> >         users-requ...@open-mpi.org
>> >
>> > You can reach the person managing the list at
>> >         users-ow...@open-mpi.org
>> >
>> > When replying, please edit your Subject line so it is more specific
>> > than "Re: Contents of users digest..."
>> >
>> >
>> > Today's Topics:
>> >
>> >    1. MPI_Barrier hangs on second attempt but only when multiple
>> >       hosts used. (Clay Kirkland)
>> >    2. Re: MPI_Barrier hangs on second attempt but only when
>> >       multiple hosts used. (Ralph Castain)
>> >
>> >
>> > ----------------------------------------------------------------------
>> >
>> > Message: 1
>> > Date: Fri, 2 May 2014 16:24:04 -0500
>> > From: Clay Kirkland <clay.kirkl...@versityinc.com>
>> > To: us...@open-mpi.org
>> > Subject: [OMPI users] MPI_Barrier hangs on second attempt but only
>> >         when    multiple hosts used.
>> > Message-ID:
>> >         <CAJDnjA8Wi=FEjz6Vz+Bc34b+nFE=
>> tf4b7g0bqgmbekg7h-p...@mail.gmail.com>
>> > Content-Type: text/plain; charset="utf-8"
>> >
>> > I have been using MPI for many many years so I have very well debugged
>> mpi
>> > tests.   I am
>> > having trouble on either openmpi-1.4.5  or  openmpi-1.6.5 versions
>> though
>> > with getting the
>> > MPI_Barrier calls to work.   It works fine when I run all processes on
>> one
>> > machine but when
>> > I run with two or more hosts the second call to MPI_Barrier always
>> hangs.
>> > Not the first one,
>> > but always the second one.   I looked at FAQ's and such but found
>> nothing
>> > except for a comment
>> > that MPI_Barrier problems were often problems with fire walls.  Also
>> > mentioned as a problem
>> > was not having the same version of mpi on both machines.  I turned
>> > firewalls off and removed
>> > and reinstalled the same version on both hosts but I still see the same
>> > thing.   I then installed
>> > lam mpi on two of my machines and that works fine.   I can call the
>> > MPI_Barrier function when run on
>> > one of two machines by itself  many times with no hangs.  Only hangs if
>> two
>> > or more hosts are involved.
>> > These runs are all being done on CentOS release 6.4.   Here is test
>> program
>> > I used.
>> >
>> > main (argc, argv)
>> > int argc;
>> > char **argv;
>> > {
>> >     char message[20];
>> >     char hoster[256];
>> >     char nameis[256];
>> >     int fd, i, j, jnp, iret, myrank, np, ranker, recker;
>> >     MPI_Comm comm;
>> >     MPI_Status status;
>> >
>> >     MPI_Init( &argc, &argv );
>> >     MPI_Comm_rank( MPI_COMM_WORLD, &myrank);
>> >     MPI_Comm_size( MPI_COMM_WORLD, &np);
>> >
>> >         gethostname(hoster,256);
>> >
>> >         printf(" In rank %d and host= %s  Do Barrier call
>> > 1.\n",myrank,hoster);
>> >     MPI_Barrier(MPI_COMM_WORLD);
>> >         printf(" In rank %d and host= %s  Do Barrier call
>> > 2.\n",myrank,hoster);
>> >     MPI_Barrier(MPI_COMM_WORLD);
>> >         printf(" In rank %d and host= %s  Do Barrier call
>> > 3.\n",myrank,hoster);
>> >     MPI_Barrier(MPI_COMM_WORLD);
>> >     MPI_Finalize();
>> >     exit(0);
>> > }
>> >
>> >   Here are three runs of test program.  First with two processes on one
>> > host, then with
>> > two processes on another host, and finally with one process on each of
>> two
>> > hosts.  The
>> > first two runs are fine but the last run hangs on the second
>> MPI_Barrier.
>> >
>> > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out
>> >  In rank 0 and host= centos  Do Barrier call 1.
>> >  In rank 1 and host= centos  Do Barrier call 1.
>> >  In rank 1 and host= centos  Do Barrier call 2.
>> >  In rank 1 and host= centos  Do Barrier call 3.
>> >  In rank 0 and host= centos  Do Barrier call 2.
>> >  In rank 0 and host= centos  Do Barrier call 3.
>> > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out
>> > /root/.bashrc: line 14: unalias: ls: not found
>> >  In rank 0 and host= RAID  Do Barrier call 1.
>> >  In rank 0 and host= RAID  Do Barrier call 2.
>> >  In rank 0 and host= RAID  Do Barrier call 3.
>> >  In rank 1 and host= RAID  Do Barrier call 1.
>> >  In rank 1 and host= RAID  Do Barrier call 2.
>> >  In rank 1 and host= RAID  Do Barrier call 3.
>> > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos,RAID a.out
>> > /root/.bashrc: line 14: unalias: ls: not found
>> >  In rank 0 and host= centos  Do Barrier call 1.
>> >  In rank 0 and host= centos  Do Barrier call 2.
>> > In rank 1 and host= RAID  Do Barrier call 1.
>> >  In rank 1 and host= RAID  Do Barrier call 2.
>> >
>> >   Since it is such a simple test and problem and such a widely used MPI
>> > function, it must obviously
>> > be an installation or configuration problem.   A pstack for each of the
>> > hung MPI_Barrier processes
>> > on the two machines shows this:
>> >
>> > [root@centos ~]# pstack 31666
>> > #0  0x0000003baf0e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6
>> > #1  0x00007f5de06125eb in epoll_dispatch () from
>> /usr/local/lib/libmpi.so.1
>> > #2  0x00007f5de061475a in opal_event_base_loop () from
>> > /usr/local/lib/libmpi.so.1
>> > #3  0x00007f5de0639229 in opal_progress () from
>> /usr/local/lib/libmpi.so.1
>> > #4  0x00007f5de0586f75 in ompi_request_default_wait_all () from
>> > /usr/local/lib/libmpi.so.1
>> > #5  0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from
>> > /usr/local/lib/openmpi/mca_coll_tuned.so
>> > #6  0x00007f5ddc59d8ff in ompi_coll_tuned_barrier_intra_two_procs ()
>> from
>> > /usr/local/lib/openmpi/mca_coll_tuned.so
>> > #7  0x00007f5de05941c2 in PMPI_Barrier () from
>> /usr/local/lib/libmpi.so.1
>> > #8  0x0000000000400a43 in main ()
>> >
>> > [root@RAID openmpi-1.6.5]# pstack 22167
>> > #0  0x00000030302e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6
>> > #1  0x00007f7ee46885eb in epoll_dispatch () from
>> /usr/local/lib/libmpi.so.1
>> > #2  0x00007f7ee468a75a in opal_event_base_loop () from
>> > /usr/local/lib/libmpi.so.1
>> > #3  0x00007f7ee46af229 in opal_progress () from
>> /usr/local/lib/libmpi.so.1
>> > #4  0x00007f7ee45fcf75 in ompi_request_default_wait_all () from
>> > /usr/local/lib/libmpi.so.1
>> > #5  0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from
>> > /usr/local/lib/openmpi/mca_coll_tuned.so
>> > #6  0x00007f7ee06138ff in ompi_coll_tuned_barrier_intra_two_procs ()
>> from
>> > /usr/local/lib/openmpi/mca_coll_tuned.so
>> > #7  0x00007f7ee460a1c2 in PMPI_Barrier () from
>> /usr/local/lib/libmpi.so.1
>> > #8  0x0000000000400a43 in main ()
>> >
>> >  Which looks exactly the same on each machine.  Any thoughts or ideas
>> would
>> > be greatly appreciated as
>> > I am stuck.
>> >
>> >  Clay Kirkland
>> > -------------- next part --------------
>> > HTML attachment scrubbed and removed
>> >
>> > ------------------------------
>> >
>> > Message: 2
>> > Date: Sat, 3 May 2014 06:39:20 -0700
>> > From: Ralph Castain <r...@open-mpi.org>
>> > To: Open MPI Users <us...@open-mpi.org>
>> > Subject: Re: [OMPI users] MPI_Barrier hangs on second attempt but only
>> >         when    multiple hosts used.
>> > Message-ID: <3cf53d73-15d9-40bb-a2de-50ba3561a...@open-mpi.org>
>> > Content-Type: text/plain; charset="us-ascii"
>> >
>> > Hmmm...just testing on my little cluster here on two nodes, it works
>> just fine with 1.8.2:
>> >
>> > [rhc@bend001 v1.8]$ mpirun -n 2 --map-by node ./a.out
>> >  In rank 0 and host= bend001  Do Barrier call 1.
>> >  In rank 0 and host= bend001  Do Barrier call 2.
>> >  In rank 0 and host= bend001  Do Barrier call 3.
>> >  In rank 1 and host= bend002  Do Barrier call 1.
>> >  In rank 1 and host= bend002  Do Barrier call 2.
>> >  In rank 1 and host= bend002  Do Barrier call 3.
>> > [rhc@bend001 v1.8]$
>> >
>> >
>> > How are you configuring OMPI?
>> >
>> >
>> > On May 2, 2014, at 2:24 PM, Clay Kirkland <clay.kirkl...@versityinc.com>
>> wrote:
>> >
>> > > I have been using MPI for many many years so I have very well
>> debugged mpi tests.   I am
>> > > having trouble on either openmpi-1.4.5  or  openmpi-1.6.5 versions
>> though with getting the
>> > > MPI_Barrier calls to work.   It works fine when I run all processes
>> on one machine but when
>> > > I run with two or more hosts the second call to MPI_Barrier always
>> hangs.   Not the first one,
>> > > but always the second one.   I looked at FAQ's and such but found
>> nothing except for a comment
>> > > that MPI_Barrier problems were often problems with fire walls.  Also
>> mentioned as a problem
>> > > was not having the same version of mpi on both machines.  I turned
>> firewalls off and removed
>> > > and reinstalled the same version on both hosts but I still see the
>> same thing.   I then installed
>> > > lam mpi on two of my machines and that works fine.   I can call the
>> MPI_Barrier function when run on
>> > > one of two machines by itself  many times with no hangs.  Only hangs
>> if two or more hosts are involved.
>> > > These runs are all being done on CentOS release 6.4.   Here is test
>> program I used.
>> > >
>> > > main (argc, argv)
>> > > int argc;
>> > > char **argv;
>> > > {
>> > >     char message[20];
>> > >     char hoster[256];
>> > >     char nameis[256];
>> > >     int fd, i, j, jnp, iret, myrank, np, ranker, recker;
>> > >     MPI_Comm comm;
>> > >     MPI_Status status;
>> > >
>> > >     MPI_Init( &argc, &argv );
>> > >     MPI_Comm_rank( MPI_COMM_WORLD, &myrank);
>> > >     MPI_Comm_size( MPI_COMM_WORLD, &np);
>> > >
>> > >         gethostname(hoster,256);
>> > >
>> > >         printf(" In rank %d and host= %s  Do Barrier call
>> 1.\n",myrank,hoster);
>> > >     MPI_Barrier(MPI_COMM_WORLD);
>> > >         printf(" In rank %d and host= %s  Do Barrier call
>> 2.\n",myrank,hoster);
>> > >     MPI_Barrier(MPI_COMM_WORLD);
>> > >         printf(" In rank %d and host= %s  Do Barrier call
>> 3.\n",myrank,hoster);
>> > >     MPI_Barrier(MPI_COMM_WORLD);
>> > >     MPI_Finalize();
>> > >     exit(0);
>> > > }
>> > >
>> > >   Here are three runs of test program.  First with two processes on
>> one host, then with
>> > > two processes on another host, and finally with one process on each
>> of two hosts.  The
>> > > first two runs are fine but the last run hangs on the second
>> MPI_Barrier.
>> > >
>> > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out
>> > >  In rank 0 and host= centos  Do Barrier call 1.
>> > >  In rank 1 and host= centos  Do Barrier call 1.
>> > >  In rank 1 and host= centos  Do Barrier call 2.
>> > >  In rank 1 and host= centos  Do Barrier call 3.
>> > >  In rank 0 and host= centos  Do Barrier call 2.
>> > >  In rank 0 and host= centos  Do Barrier call 3.
>> > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out
>> > > /root/.bashrc: line 14: unalias: ls: not found
>> > >  In rank 0 and host= RAID  Do Barrier call 1.
>> > >  In rank 0 and host= RAID  Do Barrier call 2.
>> > >  In rank 0 and host= RAID  Do Barrier call 3.
>> > >  In rank 1 and host= RAID  Do Barrier call 1.
>> > >  In rank 1 and host= RAID  Do Barrier call 2.
>> > >  In rank 1 and host= RAID  Do Barrier call 3.
>> > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos,RAID
>> a.out
>> > > /root/.bashrc: line 14: unalias: ls: not found
>> > >  In rank 0 and host= centos  Do Barrier call 1.
>> > >  In rank 0 and host= centos  Do Barrier call 2.
>> > > In rank 1 and host= RAID  Do Barrier call 1.
>> > >  In rank 1 and host= RAID  Do Barrier call 2.
>> > >
>> > >   Since it is such a simple test and problem and such a widely used
>> MPI function, it must obviously
>> > > be an installation or configuration problem.   A pstack for each of
>> the hung MPI_Barrier processes
>> > > on the two machines shows this:
>> > >
>> > > [root@centos ~]# pstack 31666
>> > > #0  0x0000003baf0e8ee3 in __epoll_wait_nocancel () from
>> /lib64/libc.so.6
>> > > #1  0x00007f5de06125eb in epoll_dispatch () from
>> /usr/local/lib/libmpi.so.1
>> > > #2  0x00007f5de061475a in opal_event_base_loop () from
>> /usr/local/lib/libmpi.so.1
>> > > #3  0x00007f5de0639229 in opal_progress () from
>> /usr/local/lib/libmpi.so.1
>> > > #4  0x00007f5de0586f75 in ompi_request_default_wait_all () from
>> /usr/local/lib/libmpi.so.1
>> > > #5  0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from
>> /usr/local/lib/openmpi/mca_coll_tuned.so
>> > > #6  0x00007f5ddc59d8ff in ompi_coll_tuned_barrier_intra_two_procs ()
>> from /usr/local/lib/openmpi/mca_coll_tuned.so
>> > > #7  0x00007f5de05941c2 in PMPI_Barrier () from
>> /usr/local/lib/libmpi.so.1
>> > > #8  0x0000000000400a43 in main ()
>> > >
>> > > [root@RAID openmpi-1.6.5]# pstack 22167
>> > > #0  0x00000030302e8ee3 in __epoll_wait_nocancel () from
>> /lib64/libc.so.6
>> > > #1  0x00007f7ee46885eb in epoll_dispatch () from
>> /usr/local/lib/libmpi.so.1
>> > > #2  0x00007f7ee468a75a in opal_event_base_loop () from
>> /usr/local/lib/libmpi.so.1
>> > > #3  0x00007f7ee46af229 in opal_progress () from
>> /usr/local/lib/libmpi.so.1
>> > > #4  0x00007f7ee45fcf75 in ompi_request_default_wait_all () from
>> /usr/local/lib/libmpi.so.1
>> > > #5  0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from
>> /usr/local/lib/openmpi/mca_coll_tuned.so
>> > > #6  0x00007f7ee06138ff in ompi_coll_tuned_barrier_intra_two_procs ()
>> from /usr/local/lib/openmpi/mca_coll_tuned.so
>> > > #7  0x00007f7ee460a1c2 in PMPI_Barrier () from
>> /usr/local/lib/libmpi.so.1
>> > > #8  0x0000000000400a43 in main ()
>> > >
>> > >  Which looks exactly the same on each machine.  Any thoughts or ideas
>> would be greatly appreciated as
>> > > I am stuck.
>> > >
>> > >  Clay Kirkland
>> > >
>> > >
>> > >
>> > >
>> > > _______________________________________________
>> > > users mailing list
>> > > us...@open-mpi.org
>> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> > -------------- next part --------------
>> > HTML attachment scrubbed and removed
>> >
>> > ------------------------------
>> >
>> > Subject: Digest Footer
>> >
>> > _______________________________________________
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> > ------------------------------
>> >
>> > End of users Digest, Vol 2879, Issue 1
>> > **************************************
>> >
>> > _______________________________________________
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>>
>> ------------------------------
>>
>> Subject: Digest Footer
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> ------------------------------
>>
>> End of users Digest, Vol 2881, Issue 1
>> **************************************
>>
>
>

Reply via email to