Hello, Is it safe to re-use the same buffer (variable A) for MPI_Send and MPI_Recv given that MPI_Send may be eager depending on the MCA parameters ?
> > > Sébastien > ________________________________________ > De : users-boun...@open-mpi.org [users-boun...@open-mpi.org] de la part de > Ole Nielsen [ole.moller.niel...@gmail.com] > Date d'envoi : 19 septembre 2011 04:59 > À : us...@open-mpi.org > Objet : [OMPI users] MPI hangs on multiple nodes > > The test program is available here: > http://code.google.com/p/pypar/source/browse/source/mpi_test.c > > Hopefully, someone can help us troubleshoot why communications stop when > multiple nodes are involved and CPU usage goes to 100% for as long as we > leave the program running. > > Many thanks > Ole Nielsen > > > ---------- Forwarded message ---------- > From: Ole Nielsen > <ole.moller.niel...@gmail.com<mailto:ole.moller.niel...@gmail.com>> > Date: Mon, Sep 19, 2011 at 3:39 PM > Subject: Re: MPI hangs on multiple nodes > To: us...@open-mpi.org<mailto:us...@open-mpi.org> > > > Further to the posting below, I can report that the test program (attached - > this time correctly) is chewing up CPU time on both compute nodes for as long > as I care to let it continue. > It would appear that MPI_Receive which is the next command after the print > statements in the test program. > > Has anyone else seen this behavior or can anyone give me a hint on how to > troubleshoot. > > Cheers and thanks > Ole Nielsen > > Output: > > nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host > node17,node18 --npernode 2 a.out > Number of processes = 4 > Test repeated 3 times for reliability > I am process 2 on node node18 > P2: Waiting to receive from to P1 > I am process 0 on node node17 > Run 1 of 3 > P0: Sending to P1 > I am process 1 on node node17 > P1: Waiting to receive from to P0 > I am process 3 on node node18 > P3: Waiting to receive from to P2 > P0: Waiting to receive from P3 > > P1: Sending to to P2 > P1: Waiting to receive from to P0 > P2: Sending to to P3 > > P0: Received from to P3 > Run 2 of 3 > P0: Sending to P1 > P3: Sending to to P0 > > P3: Waiting to receive from to P2 > P2: Waiting to receive from to P1 > P1: Sending to to P2 > P0: Waiting to receive from P3 > > > > > > > > > > On Mon, Sep 19, 2011 at 11:04 AM, Ole Nielsen > <ole.moller.niel...@gmail.com<mailto:ole.moller.niel...@gmail.com>> wrote: > > Hi all > > We have been using OpenMPI for many years with Ubuntu on our 20-node cluster. > Each node has 2 quad cores, so we usually run up to 8 processes on each node > up to a maximum of 160 processes. > > However, we just upgraded the cluster to Ubuntu 11.04 with Open MPI 1.4.3 and > and have come across a strange behavior where mpi programs run perfectly well > when confined to one node but hangs during communication across multiple > nodes. We have no idea why and would like some help in debugging this. A > small MPI test program is attached and typical output shown below. > > Hope someone can help us > Cheers and thanks > Ole Nielsen > > -------------------- Test output across two nodes (This one hangs) > -------------------------- > nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host > node17,node18 --npernode 2 a.out > Number of processes = 4 > Test repeated 3 times for reliability > I am process 1 on node node17 > P1: Waiting to receive from to P0 > I am process 0 on node node17 > Run 1 of 3 > P0: Sending to P1 > I am process 2 on node node18 > P2: Waiting to receive from to P1 > I am process 3 on node node18 > P3: Waiting to receive from to P2 > P1: Sending to to P2 > > > -------------------- Test output within one node (This one is OK) > -------------------------- > nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host > node17 --npernode 4 a.out > Number of processes = 4 > Test repeated 3 times for reliability > I am process 2 on node node17 > P2: Waiting to receive from to P1 > I am process 0 on node node17 > Run 1 of 3 > P0: Sending to P1 > I am process 1 on node node17 > P1: Waiting to receive from to P0 > I am process 3 on node node17 > P3: Waiting to receive from to P2 > P1: Sending to to P2 > P2: Sending to to P3 > P1: Waiting to receive from to P0 > P2: Waiting to receive from to P1 > P3: Sending to to P0 > P0: Received from to P3 > Run 2 of 3 > P0: Sending to P1 > P3: Waiting to receive from to P2 > P1: Sending to to P2 > P2: Sending to to P3 > P1: Waiting to receive from to P0 > P3: Sending to to P0 > P2: Waiting to receive from to P1 > P0: Received from to P3 > Run 3 of 3 > P0: Sending to P1 > P3: Waiting to receive from to P2 > P1: Sending to to P2 > P2: Sending to to P3 > P1: Done > P2: Done > P3: Sending to to P0 > P0: Received from to P3 > P0: Done > P3: Done > > > > > >