The test program is available here: http://code.google.com/p/pypar/source/browse/source/mpi_test.c
Hopefully, someone can help us troubleshoot why communications stop when multiple nodes are involved and CPU usage goes to 100% for as long as we leave the program running. Many thanks Ole Nielsen ---------- Forwarded message ---------- From: Ole Nielsen <ole.moller.niel...@gmail.com> List-Post: users@lists.open-mpi.org Date: Mon, Sep 19, 2011 at 3:39 PM Subject: Re: MPI hangs on multiple nodes To: us...@open-mpi.org Further to the posting below, I can report that the test program (attached - this time correctly) is chewing up CPU time on both compute nodes for as long as I care to let it continue. It would appear that MPI_Receive which is the next command after the print statements in the test program. Has anyone else seen this behavior or can anyone give me a hint on how to troubleshoot. Cheers and thanks Ole Nielsen Output: nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host node17,node18 --npernode 2 a.out Number of processes = 4 Test repeated 3 times for reliability I am process 2 on node node18 P2: Waiting to receive from to P1 I am process 0 on node node17 Run 1 of 3 P0: Sending to P1 I am process 1 on node node17 P1: Waiting to receive from to P0 I am process 3 on node node18 P3: Waiting to receive from to P2 P0: Waiting to receive from P3 P1: Sending to to P2 P1: Waiting to receive from to P0 P2: Sending to to P3 P0: Received from to P3 Run 2 of 3 P0: Sending to P1 P3: Sending to to P0 P3: Waiting to receive from to P2 P2: Waiting to receive from to P1 P1: Sending to to P2 P0: Waiting to receive from P3 On Mon, Sep 19, 2011 at 11:04 AM, Ole Nielsen <ole.moller.niel...@gmail.com>wrote: > > Hi all > > We have been using OpenMPI for many years with Ubuntu on our 20-node > cluster. Each node has 2 quad cores, so we usually run up to 8 processes on > each node up to a maximum of 160 processes. > > However, we just upgraded the cluster to Ubuntu 11.04 with Open MPI 1.4.3 > and and have come across a strange behavior where mpi programs run perfectly > well when confined to one node but hangs during communication across > multiple nodes. We have no idea why and would like some help in debugging > this. A small MPI test program is attached and typical output shown below. > > Hope someone can help us > Cheers and thanks > Ole Nielsen > > -------------------- Test output across two nodes (This one hangs) > -------------------------- > nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts > --host node17,node18 --npernode 2 a.out > Number of processes = 4 > Test repeated 3 times for reliability > I am process 1 on node node17 > P1: Waiting to receive from to P0 > I am process 0 on node node17 > Run 1 of 3 > P0: Sending to P1 > I am process 2 on node node18 > P2: Waiting to receive from to P1 > I am process 3 on node node18 > P3: Waiting to receive from to P2 > P1: Sending to to P2 > > > -------------------- Test output within one node (This one is OK) > -------------------------- > nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts > --host node17 --npernode 4 a.out > Number of processes = 4 > Test repeated 3 times for reliability > I am process 2 on node node17 > P2: Waiting to receive from to P1 > I am process 0 on node node17 > Run 1 of 3 > P0: Sending to P1 > I am process 1 on node node17 > P1: Waiting to receive from to P0 > I am process 3 on node node17 > P3: Waiting to receive from to P2 > P1: Sending to to P2 > P2: Sending to to P3 > P1: Waiting to receive from to P0 > P2: Waiting to receive from to P1 > P3: Sending to to P0 > P0: Received from to P3 > Run 2 of 3 > P0: Sending to P1 > P3: Waiting to receive from to P2 > P1: Sending to to P2 > P2: Sending to to P3 > P1: Waiting to receive from to P0 > P3: Sending to to P0 > P2: Waiting to receive from to P1 > P0: Received from to P3 > Run 3 of 3 > P0: Sending to P1 > P3: Waiting to receive from to P2 > P1: Sending to to P2 > P2: Sending to to P3 > P1: Done > P2: Done > P3: Sending to to P0 > P0: Received from to P3 > P0: Done > P3: Done > > > >