please unsubscribe me from this maillist. Thank you,
-Bill Lane ________________________________ From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] on behalf of Ole Nielsen [ole.moller.niel...@gmail.com] Sent: Monday, September 19, 2011 1:39 AM To: us...@open-mpi.org Subject: Re: [OMPI users] MPI hangs on multiple nodes Further to the posting below, I can report that the test program (attached - this time correctly) is chewing up CPU time on both compute nodes for as long as I care to let it continue. It would appear that MPI_Receive which is the next command after the print statements in the test program. Has anyone else seen this behavior or can anyone give me a hint on how to troubleshoot. Cheers and thanks Ole Nielsen Output: nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host node17,node18 --npernode 2 a.out Number of processes = 4 Test repeated 3 times for reliability I am process 2 on node node18 P2: Waiting to receive from to P1 I am process 0 on node node17 Run 1 of 3 P0: Sending to P1 I am process 1 on node node17 P1: Waiting to receive from to P0 I am process 3 on node node18 P3: Waiting to receive from to P2 P0: Waiting to receive from P3 P1: Sending to to P2 P1: Waiting to receive from to P0 P2: Sending to to P3 P0: Received from to P3 Run 2 of 3 P0: Sending to P1 P3: Sending to to P0 P3: Waiting to receive from to P2 P2: Waiting to receive from to P1 P1: Sending to to P2 P0: Waiting to receive from P3 On Mon, Sep 19, 2011 at 11:04 AM, Ole Nielsen <ole.moller.niel...@gmail.com<mailto:ole.moller.niel...@gmail.com>> wrote: Hi all We have been using OpenMPI for many years with Ubuntu on our 20-node cluster. Each node has 2 quad cores, so we usually run up to 8 processes on each node up to a maximum of 160 processes. However, we just upgraded the cluster to Ubuntu 11.04 with Open MPI 1.4.3 and and have come across a strange behavior where mpi programs run perfectly well when confined to one node but hangs during communication across multiple nodes. We have no idea why and would like some help in debugging this. A small MPI test program is attached and typical output shown below. Hope someone can help us Cheers and thanks Ole Nielsen -------------------- Test output across two nodes (This one hangs) -------------------------- nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host node17,node18 --npernode 2 a.out Number of processes = 4 Test repeated 3 times for reliability I am process 1 on node node17 P1: Waiting to receive from to P0 I am process 0 on node node17 Run 1 of 3 P0: Sending to P1 I am process 2 on node node18 P2: Waiting to receive from to P1 I am process 3 on node node18 P3: Waiting to receive from to P2 P1: Sending to to P2 -------------------- Test output within one node (This one is OK) -------------------------- nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host node17 --npernode 4 a.out Number of processes = 4 Test repeated 3 times for reliability I am process 2 on node node17 P2: Waiting to receive from to P1 I am process 0 on node node17 Run 1 of 3 P0: Sending to P1 I am process 1 on node node17 P1: Waiting to receive from to P0 I am process 3 on node node17 P3: Waiting to receive from to P2 P1: Sending to to P2 P2: Sending to to P3 P1: Waiting to receive from to P0 P2: Waiting to receive from to P1 P3: Sending to to P0 P0: Received from to P3 Run 2 of 3 P0: Sending to P1 P3: Waiting to receive from to P2 P1: Sending to to P2 P2: Sending to to P3 P1: Waiting to receive from to P0 P3: Sending to to P0 P2: Waiting to receive from to P1 P0: Received from to P3 Run 3 of 3 P0: Sending to P1 P3: Waiting to receive from to P2 P1: Sending to to P2 P2: Sending to to P3 P1: Done P2: Done P3: Sending to to P0 P0: Received from to P3 P0: Done P3: Done IMPORTANT WARNING: This message is intended for the use of the person or entity to which it is addressed and may contain information that is privileged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this information is STRICTLY PROHIBITED. If you have received this message in error, please notify us immediately by calling (310) 423-6428 and destroy the related message. Thank You for your cooperation. IMPORTANT WARNING: This message is intended for the use of the person or entity to which it is addressed and may contain information that is privileged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this information is STRICTLY PROHIBITED. If you have received this message in error, please notify us immediately by calling (310) 423-6428 and destroy the related message. Thank You for your cooperation.