Hi Ole

You could try the examples/connectivity.c program in the
OpenMPI source tree, to test if everything is alright.
It also hints how to solve the buffer re-use issue
that Sebastien [rightfully] pointed out [i.e., declare separate
buffers for MPI_Send and MPI_Recv].

Gus Correa

Sébastien Boisvert wrote:
Hello,

Is it safe to re-use the same buffer (variable A) for MPI_Send and MPI_Recv given that MPI_Send may be eager depending on
the MCA parameters ?




Sébastien
________________________________________
De : users-boun...@open-mpi.org [users-boun...@open-mpi.org] de la part de Ole 
Nielsen [ole.moller.niel...@gmail.com]
Date d'envoi : 19 septembre 2011 04:59
À : us...@open-mpi.org
Objet : [OMPI users] MPI hangs on multiple nodes

The test program is available here:
http://code.google.com/p/pypar/source/browse/source/mpi_test.c

Hopefully, someone can help us troubleshoot why communications stop when 
multiple nodes are involved and CPU usage goes to 100% for as long as we leave 
the program running.

Many thanks
Ole Nielsen


---------- Forwarded message ----------
From: Ole Nielsen 
<ole.moller.niel...@gmail.com<mailto:ole.moller.niel...@gmail.com>>
Date: Mon, Sep 19, 2011 at 3:39 PM
Subject: Re: MPI hangs on multiple nodes
To: us...@open-mpi.org<mailto:us...@open-mpi.org>


Further to the posting below, I can report that the test program (attached - 
this time correctly) is chewing up CPU time on both compute nodes for as long 
as I care to let it continue.
It would appear that MPI_Receive which is the next command after the print 
statements in the test program.

Has anyone else seen this behavior or can anyone give me a hint on how to 
troubleshoot.

Cheers and thanks
Ole Nielsen

Output:

nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host 
node17,node18 --npernode 2 a.out
Number of processes = 4
Test repeated 3 times for reliability
I am process 2 on node node18
P2: Waiting to receive from to P1
I am process 0 on node node17
Run 1 of 3
P0: Sending to P1
I am process 1 on node node17
P1: Waiting to receive from to P0
I am process 3 on node node18
P3: Waiting to receive from to P2
P0: Waiting to receive from P3

P1: Sending to to P2
P1: Waiting to receive from to P0
P2: Sending to to P3

P0: Received from to P3
Run 2 of 3
P0: Sending to P1
P3: Sending to to P0

P3: Waiting to receive from to P2
P2: Waiting to receive from to P1
P1: Sending to to P2
P0: Waiting to receive from P3









On Mon, Sep 19, 2011 at 11:04 AM, Ole Nielsen 
<ole.moller.niel...@gmail.com<mailto:ole.moller.niel...@gmail.com>> wrote:

Hi all

We have been using OpenMPI for many years with Ubuntu on our 20-node cluster. 
Each node has 2 quad cores, so we usually run up to 8 processes on each node up 
to a maximum of 160 processes.

However, we just upgraded the cluster to Ubuntu 11.04 with Open MPI 1.4.3 and 
and have come across a strange behavior where mpi programs run perfectly well 
when confined to one node but hangs during communication across multiple nodes. 
We have no idea why and would like some help in debugging this. A small MPI 
test program is attached and typical output shown below.

Hope someone can help us
Cheers and thanks
Ole Nielsen

-------------------- Test output across two nodes (This one hangs) 
--------------------------
nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host 
node17,node18 --npernode 2 a.out
Number of processes = 4
Test repeated 3 times for reliability
I am process 1 on node node17
P1: Waiting to receive from to P0
I am process 0 on node node17
Run 1 of 3
P0: Sending to P1
I am process 2 on node node18
P2: Waiting to receive from to P1
I am process 3 on node node18
P3: Waiting to receive from to P2
P1: Sending to to P2


-------------------- Test output within one node (This one is OK) 
--------------------------
nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host 
node17 --npernode 4 a.out
Number of processes = 4
Test repeated 3 times for reliability
I am process 2 on node node17
P2: Waiting to receive from to P1
I am process 0 on node node17
Run 1 of 3
P0: Sending to P1
I am process 1 on node node17
P1: Waiting to receive from to P0
I am process 3 on node node17
P3: Waiting to receive from to P2
P1: Sending to to P2
P2: Sending to to P3
P1: Waiting to receive from to P0
P2: Waiting to receive from to P1
P3: Sending to to P0
P0: Received from to P3
Run 2 of 3
P0: Sending to P1
P3: Waiting to receive from to P2
P1: Sending to to P2
P2: Sending to to P3
P1: Waiting to receive from to P0
P3: Sending to to P0
P2: Waiting to receive from to P1
P0: Received from to P3
Run 3 of 3
P0: Sending to P1
P3: Waiting to receive from to P2
P1: Sending to to P2
P2: Sending to to P3
P1: Done
P2: Done
P3: Sending to to P0
P0: Received from to P3
P0: Done
P3: Done






_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to