[OMPI users] RE : MPI hangs on multiple nodes

Sébastien Boisvert Mon, 19 Sep 2011 09:28:15 -0400

Hello,

Is it safe to re-use the same buffer (variable A) for MPI_Send and MPI_Recv 
given that MPI_Send may be eager depending on
the MCA parameters ?




> 
> 
> Sébastien
> ________________________________________
> De : users-boun...@open-mpi.org [users-boun...@open-mpi.org] de la part de 
> Ole Nielsen [ole.moller.niel...@gmail.com]
> Date d'envoi : 19 septembre 2011 04:59
> À : us...@open-mpi.org
> Objet : [OMPI users] MPI hangs on multiple nodes
> 
> The test program is available here:
> http://code.google.com/p/pypar/source/browse/source/mpi_test.c
> 
> Hopefully, someone can help us troubleshoot why communications stop when 
> multiple nodes are involved and CPU usage goes to 100% for as long as we 
> leave the program running.
> 
> Many thanks
> Ole Nielsen
> 
> 
> ---------- Forwarded message ----------
> From: Ole Nielsen 
> <ole.moller.niel...@gmail.com<mailto:ole.moller.niel...@gmail.com>>
> Date: Mon, Sep 19, 2011 at 3:39 PM
> Subject: Re: MPI hangs on multiple nodes
> To: us...@open-mpi.org<mailto:us...@open-mpi.org>
> 
> 
> Further to the posting below, I can report that the test program (attached - 
> this time correctly) is chewing up CPU time on both compute nodes for as long 
> as I care to let it continue.
> It would appear that MPI_Receive which is the next command after the print 
> statements in the test program.
> 
> Has anyone else seen this behavior or can anyone give me a hint on how to 
> troubleshoot.
> 
> Cheers and thanks
> Ole Nielsen
> 
> Output:
> 
> nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host 
> node17,node18 --npernode 2 a.out
> Number of processes = 4
> Test repeated 3 times for reliability
> I am process 2 on node node18
> P2: Waiting to receive from to P1
> I am process 0 on node node17
> Run 1 of 3
> P0: Sending to P1
> I am process 1 on node node17
> P1: Waiting to receive from to P0
> I am process 3 on node node18
> P3: Waiting to receive from to P2
> P0: Waiting to receive from P3
> 
> P1: Sending to to P2
> P1: Waiting to receive from to P0
> P2: Sending to to P3
> 
> P0: Received from to P3
> Run 2 of 3
> P0: Sending to P1
> P3: Sending to to P0
> 
> P3: Waiting to receive from to P2
> P2: Waiting to receive from to P1
> P1: Sending to to P2
> P0: Waiting to receive from P3
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Mon, Sep 19, 2011 at 11:04 AM, Ole Nielsen 
> <ole.moller.niel...@gmail.com<mailto:ole.moller.niel...@gmail.com>> wrote:
> 
> Hi all
> 
> We have been using OpenMPI for many years with Ubuntu on our 20-node cluster. 
> Each node has 2 quad cores, so we usually run up to 8 processes on each node 
> up to a maximum of 160 processes.
> 
> However, we just upgraded the cluster to Ubuntu 11.04 with Open MPI 1.4.3 and 
> and have come across a strange behavior where mpi programs run perfectly well 
> when confined to one node but hangs during communication across multiple 
> nodes. We have no idea why and would like some help in debugging this. A 
> small MPI test program is attached and typical output shown below.
> 
> Hope someone can help us
> Cheers and thanks
> Ole Nielsen
> 
> -------------------- Test output across two nodes (This one hangs) 
> --------------------------
> nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host 
> node17,node18 --npernode 2 a.out
> Number of processes = 4
> Test repeated 3 times for reliability
> I am process 1 on node node17
> P1: Waiting to receive from to P0
> I am process 0 on node node17
> Run 1 of 3
> P0: Sending to P1
> I am process 2 on node node18
> P2: Waiting to receive from to P1
> I am process 3 on node node18
> P3: Waiting to receive from to P2
> P1: Sending to to P2
> 
> 
> -------------------- Test output within one node (This one is OK) 
> --------------------------
> nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host 
> node17 --npernode 4 a.out
> Number of processes = 4
> Test repeated 3 times for reliability
> I am process 2 on node node17
> P2: Waiting to receive from to P1
> I am process 0 on node node17
> Run 1 of 3
> P0: Sending to P1
> I am process 1 on node node17
> P1: Waiting to receive from to P0
> I am process 3 on node node17
> P3: Waiting to receive from to P2
> P1: Sending to to P2
> P2: Sending to to P3
> P1: Waiting to receive from to P0
> P2: Waiting to receive from to P1
> P3: Sending to to P0
> P0: Received from to P3
> Run 2 of 3
> P0: Sending to P1
> P3: Waiting to receive from to P2
> P1: Sending to to P2
> P2: Sending to to P3
> P1: Waiting to receive from to P0
> P3: Sending to to P0
> P2: Waiting to receive from to P1
> P0: Received from to P3
> Run 3 of 3
> P0: Sending to P1
> P3: Waiting to receive from to P2
> P1: Sending to to P2
> P2: Sending to to P3
> P1: Done
> P2: Done
> P3: Sending to to P0
> P0: Received from to P3
> P0: Done
> P3: Done
> 
> 
> 
> 
> 
>

[OMPI users] RE : MPI hangs on multiple nodes

Reply via email to