From: Clay Kirkland [mailto:clay.kirkl...@versityinc.com]
Sent: Friday, May 02, 2014 03:24 PM
To: us...@open-mpi.org <us...@open-mpi.org>
Subject: [OMPI users] MPI_Barrier hangs on second attempt but only when 
multiple hosts used.

I have been using MPI for many many years so I have very well debugged mpi 
tests.   I am
having trouble on either openmpi-1.4.5  or  openmpi-1.6.5 versions though with 
getting the
MPI_Barrier calls to work.   It works fine when I run all processes on one 
machine but when
I run with two or more hosts the second call to MPI_Barrier always hangs.   Not 
the first one,
but always the second one.   I looked at FAQ's and such but found nothing 
except for a comment
that MPI_Barrier problems were often problems with fire walls.  Also mentioned 
as a problem
was not having the same version of mpi on both machines.  I turned firewalls 
off and removed
and reinstalled the same version on both hosts but I still see the same thing.  
 I then installed
lam mpi on two of my machines and that works fine.   I can call the MPI_Barrier 
function when run on
one of two machines by itself  many times with no hangs.  Only hangs if two or 
more hosts are involved.
These runs are all being done on CentOS release 6.4.   Here is test program I 
used.

main (argc, argv)
int argc;
char **argv;
{
    char message[20];
    char hoster[256];
    char nameis[256];
    int fd, i, j, jnp, iret, myrank, np, ranker, recker;
    MPI_Comm comm;
    MPI_Status status;

    MPI_Init( &argc, &argv );
    MPI_Comm_rank( MPI_COMM_WORLD, &myrank);
    MPI_Comm_size( MPI_COMM_WORLD, &np);

        gethostname(hoster,256);

        printf(" In rank %d and host= %s  Do Barrier call 1.\n",myrank,hoster);
    MPI_Barrier(MPI_COMM_WORLD);
        printf(" In rank %d and host= %s  Do Barrier call 2.\n",myrank,hoster);
    MPI_Barrier(MPI_COMM_WORLD);
        printf(" In rank %d and host= %s  Do Barrier call 3.\n",myrank,hoster);
    MPI_Barrier(MPI_COMM_WORLD);
    MPI_Finalize();
    exit(0);
}

  Here are three runs of test program.  First with two processes on one host, 
then with
two processes on another host, and finally with one process on each of two 
hosts.  The
first two runs are fine but the last run hangs on the second MPI_Barrier.

[root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out
 In rank 0 and host= centos  Do Barrier call 1.
 In rank 1 and host= centos  Do Barrier call 1.
 In rank 1 and host= centos  Do Barrier call 2.
 In rank 1 and host= centos  Do Barrier call 3.
 In rank 0 and host= centos  Do Barrier call 2.
 In rank 0 and host= centos  Do Barrier call 3.
[root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out
/root/.bashrc: line 14: unalias: ls: not found
 In rank 0 and host= RAID  Do Barrier call 1.
 In rank 0 and host= RAID  Do Barrier call 2.
 In rank 0 and host= RAID  Do Barrier call 3.
 In rank 1 and host= RAID  Do Barrier call 1.
 In rank 1 and host= RAID  Do Barrier call 2.
 In rank 1 and host= RAID  Do Barrier call 3.
[root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos,RAID a.out
/root/.bashrc: line 14: unalias: ls: not found
 In rank 0 and host= centos  Do Barrier call 1.
 In rank 0 and host= centos  Do Barrier call 2.
In rank 1 and host= RAID  Do Barrier call 1.
 In rank 1 and host= RAID  Do Barrier call 2.

  Since it is such a simple test and problem and such a widely used MPI 
function, it must obviously
be an installation or configuration problem.   A pstack for each of the hung 
MPI_Barrier processes
on the two machines shows this:

[root@centos ~]# pstack 31666
#0  0x0000003baf0e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x00007f5de06125eb in epoll_dispatch () from /usr/local/lib/libmpi.so.1
#2  0x00007f5de061475a in opal_event_base_loop () from 
/usr/local/lib/libmpi.so.1
#3  0x00007f5de0639229 in opal_progress () from /usr/local/lib/libmpi.so.1
#4  0x00007f5de0586f75 in ompi_request_default_wait_all () from 
/usr/local/lib/libmpi.so.1
#5  0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from 
/usr/local/lib/openmpi/mca_coll_tuned.so
#6  0x00007f5ddc59d8ff in ompi_coll_tuned_barrier_intra_two_procs () from 
/usr/local/lib/openmpi/mca_coll_tuned.so
#7  0x00007f5de05941c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1
#8  0x0000000000400a43 in main ()

[root@RAID openmpi-1.6.5]# pstack 22167
#0  0x00000030302e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x00007f7ee46885eb in epoll_dispatch () from /usr/local/lib/libmpi.so.1
#2  0x00007f7ee468a75a in opal_event_base_loop () from 
/usr/local/lib/libmpi.so.1
#3  0x00007f7ee46af229 in opal_progress () from /usr/local/lib/libmpi.so.1
#4  0x00007f7ee45fcf75 in ompi_request_default_wait_all () from 
/usr/local/lib/libmpi.so.1
#5  0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from 
/usr/local/lib/openmpi/mca_coll_tuned.so
#6  0x00007f7ee06138ff in ompi_coll_tuned_barrier_intra_two_procs () from 
/usr/local/lib/openmpi/mca_coll_tuned.so
#7  0x00007f7ee460a1c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1
#8  0x0000000000400a43 in main ()

 Which looks exactly the same on each machine.  Any thoughts or ideas would be 
greatly appreciated as
I am stuck.

 Clay Kirkland








Reply via email to