Hello Ole

I ran your program on open-mpi-1.4.2  five times, and all five times, it 
finished successfully.

So, I think the problem was with the version of mpi.

Output from your program is attached. I ran on 3 nodes:

$home/OpenMPI-1.4.2/bin/mpirun -np 3 -v --output-filename mpi_testfile 
./mpi_test

So, maybe this helps you.

Best,

Devendra Rai



________________________________
From: Ole Nielsen <ole.moller.niel...@gmail.com>
To: us...@open-mpi.org
Sent: Monday, 19 September 2011, 10:59
Subject: [OMPI users] MPI hangs on multiple nodes


The test program is available here:
http://code.google.com/p/pypar/source/browse/source/mpi_test.c

Hopefully, someone can help us troubleshoot why communications stop when 
multiple nodes are involved and CPU usage goes to 100% for as long as we leave 
the program running.

Many thanks
Ole Nielsen



---------- Forwarded message ----------
From: Ole Nielsen <ole.moller.niel...@gmail.com>
Date: Mon, Sep 19, 2011 at 3:39 PM
Subject: Re: MPI hangs on multiple nodes
To: us...@open-mpi.org


Further to the posting below, I can report that the test program (attached - 
this time correctly) is chewing up CPU time on both compute nodes for as long 
as I care to let it continue.
It would appear that MPI_Receive which is the next command after the print 
statements in the test program.

Has anyone else seen this behavior or can anyone give me a hint on how to 
troubleshoot.

Cheers and thanks
Ole Nielsen

Output:

nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host 
node17,node18 --npernode 2 a.out 
Number of processes = 4
Test repeated 3 times for reliability

I am process 2 on node node18
P2: Waiting to receive from to P1

I am process 0 on node node17
Run 1 of 3
P0: Sending to P1
I am process 1 on node node17
P1: Waiting to receive from to P0

I am process 3 on node node18
P3: Waiting to receive from to P2
P0: Waiting to receive from P3

P1: Sending to to P2

P1: Waiting to receive from to P0
P2: Sending to to P3

P0: Received from to P3
Run 2 of 3
P0: Sending to P1
P3: Sending to to P0

P3: Waiting to receive from to P2

P2: Waiting to receive from to P1

P1: Sending to to P2
P0: Waiting to receive from P3







 


On Mon, Sep 19, 2011 at 11:04 AM, Ole Nielsen <ole.moller.niel...@gmail.com> 
wrote:


>
>Hi all
>
>We have been using OpenMPI for many years with Ubuntu on our 20-node cluster. 
>Each node has 2 quad cores, so we usually run up to 8 processes on each node 
>up to a maximum of 160 processes.
>
>However, we just upgraded the cluster to Ubuntu 11.04 with Open MPI 1.4.3 and 
>and have come across a strange behavior where mpi programs run perfectly well 
>when confined to one node but hangs during communication across multiple 
>nodes. We have no idea why and would like some help in debugging this. A small 
>MPI test program is attached and typical output shown below.
>
>Hope someone can help us
>Cheers and thanks
>Ole Nielsen
>
>-------------------- Test output across two nodes (This one hangs) 
>--------------------------
>nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host 
>node17,node18 --npernode 2 a.out 
>Number of processes = 4
>Test repeated 3 times for reliability
>I am process 1 on node node17
>P1: Waiting to receive from to P0
>I am process 0 on node node17
>Run 1 of 3
>P0: Sending to P1
>I am process 2 on node node18
>P2: Waiting to receive from to P1
>I am process 3 on node node18
>P3: Waiting to receive from to P2
>P1: Sending to to P2
>
>
>-------------------- Test output within one node (This one is OK) 
>--------------------------
>nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host 
>node17 --npernode 4 a.out 
>Number of processes = 4
>Test repeated 3 times for reliability
>I am process 2 on node node17
>P2: Waiting to receive from to P1
>I am process 0 on node node17
>Run 1 of 3
>P0: Sending to P1
>I am process 1 on node node17
>P1: Waiting to receive from to P0
>I am process 3 on node node17
>P3: Waiting to receive from to P2
>P1: Sending to to P2
>P2: Sending to to P3
>P1: Waiting to receive from to P0
>P2: Waiting to receive from to P1
>P3: Sending to to P0
>P0: Received from to P3
>Run 2 of 3
>P0: Sending to P1
>P3: Waiting to receive from to P2
>P1: Sending to to P2
>P2: Sending to to P3
>P1: Waiting to receive from to P0
>P3: Sending to to P0
>P2: Waiting to receive from to P1
>P0: Received from to P3
>Run 3 of 3
>P0: Sending to P1
>P3: Waiting to receive from to P2
>P1: Sending to to P2
>P2: Sending to to P3
>P1: Done
>P2: Done
>P3: Sending to to P0
>P0: Received from to P3
>P0: Done
>P3: Done
>
>
>
>


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Attachment: mpi_testfile.1
Description: Binary data

Attachment: mpi_testfile.2
Description: Binary data

Attachment: mpi_testfile.0
Description: Binary data

Reply via email to