Hello Ole I ran your program on open-mpi-1.4.2 five times, and all five times, it finished successfully.
So, I think the problem was with the version of mpi. Output from your program is attached. I ran on 3 nodes: $home/OpenMPI-1.4.2/bin/mpirun -np 3 -v --output-filename mpi_testfile ./mpi_test So, maybe this helps you. Best, Devendra Rai ________________________________ From: Ole Nielsen <ole.moller.niel...@gmail.com> To: us...@open-mpi.org Sent: Monday, 19 September 2011, 10:59 Subject: [OMPI users] MPI hangs on multiple nodes The test program is available here: http://code.google.com/p/pypar/source/browse/source/mpi_test.c Hopefully, someone can help us troubleshoot why communications stop when multiple nodes are involved and CPU usage goes to 100% for as long as we leave the program running. Many thanks Ole Nielsen ---------- Forwarded message ---------- From: Ole Nielsen <ole.moller.niel...@gmail.com> Date: Mon, Sep 19, 2011 at 3:39 PM Subject: Re: MPI hangs on multiple nodes To: us...@open-mpi.org Further to the posting below, I can report that the test program (attached - this time correctly) is chewing up CPU time on both compute nodes for as long as I care to let it continue. It would appear that MPI_Receive which is the next command after the print statements in the test program. Has anyone else seen this behavior or can anyone give me a hint on how to troubleshoot. Cheers and thanks Ole Nielsen Output: nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host node17,node18 --npernode 2 a.out Number of processes = 4 Test repeated 3 times for reliability I am process 2 on node node18 P2: Waiting to receive from to P1 I am process 0 on node node17 Run 1 of 3 P0: Sending to P1 I am process 1 on node node17 P1: Waiting to receive from to P0 I am process 3 on node node18 P3: Waiting to receive from to P2 P0: Waiting to receive from P3 P1: Sending to to P2 P1: Waiting to receive from to P0 P2: Sending to to P3 P0: Received from to P3 Run 2 of 3 P0: Sending to P1 P3: Sending to to P0 P3: Waiting to receive from to P2 P2: Waiting to receive from to P1 P1: Sending to to P2 P0: Waiting to receive from P3 On Mon, Sep 19, 2011 at 11:04 AM, Ole Nielsen <ole.moller.niel...@gmail.com> wrote: > >Hi all > >We have been using OpenMPI for many years with Ubuntu on our 20-node cluster. >Each node has 2 quad cores, so we usually run up to 8 processes on each node >up to a maximum of 160 processes. > >However, we just upgraded the cluster to Ubuntu 11.04 with Open MPI 1.4.3 and >and have come across a strange behavior where mpi programs run perfectly well >when confined to one node but hangs during communication across multiple >nodes. We have no idea why and would like some help in debugging this. A small >MPI test program is attached and typical output shown below. > >Hope someone can help us >Cheers and thanks >Ole Nielsen > >-------------------- Test output across two nodes (This one hangs) >-------------------------- >nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host >node17,node18 --npernode 2 a.out >Number of processes = 4 >Test repeated 3 times for reliability >I am process 1 on node node17 >P1: Waiting to receive from to P0 >I am process 0 on node node17 >Run 1 of 3 >P0: Sending to P1 >I am process 2 on node node18 >P2: Waiting to receive from to P1 >I am process 3 on node node18 >P3: Waiting to receive from to P2 >P1: Sending to to P2 > > >-------------------- Test output within one node (This one is OK) >-------------------------- >nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host >node17 --npernode 4 a.out >Number of processes = 4 >Test repeated 3 times for reliability >I am process 2 on node node17 >P2: Waiting to receive from to P1 >I am process 0 on node node17 >Run 1 of 3 >P0: Sending to P1 >I am process 1 on node node17 >P1: Waiting to receive from to P0 >I am process 3 on node node17 >P3: Waiting to receive from to P2 >P1: Sending to to P2 >P2: Sending to to P3 >P1: Waiting to receive from to P0 >P2: Waiting to receive from to P1 >P3: Sending to to P0 >P0: Received from to P3 >Run 2 of 3 >P0: Sending to P1 >P3: Waiting to receive from to P2 >P1: Sending to to P2 >P2: Sending to to P3 >P1: Waiting to receive from to P0 >P3: Sending to to P0 >P2: Waiting to receive from to P1 >P0: Received from to P3 >Run 3 of 3 >P0: Sending to P1 >P3: Waiting to receive from to P2 >P1: Sending to to P2 >P2: Sending to to P3 >P1: Done >P2: Done >P3: Sending to to P0 >P0: Received from to P3 >P0: Done >P3: Done > > > > _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
mpi_testfile.1
Description: Binary data
mpi_testfile.2
Description: Binary data
mpi_testfile.0
Description: Binary data