Re: [OMPI users] MPI hangs on multiple nodes

2011-09-25 Thread Ole Nielsen
; /usr/mpi/intel/openmpi-1.4.3/lib64/libopen-rte.so.0 (0x2b6e7cb9c000) >libopen-pal.so.0 => > /usr/mpi/intel/openmpi-1.4.3/lib64/libopen-pal.so.0 (0x2b6e7ce01000) > libdl.so.2 => /lib64/libdl.so.2 (0x2b6e7d077000) >libnsl.so.1 => /lib64/libnsl.so.1 (0x2b

Re: [OMPI users] MPI hangs on multiple nodes

2011-09-20 Thread Gus Correa
Ole Nielsen wrote: Thanks for your suggestion Gus, we need a way of debugging what is going on. I am pretty sure the problem lies with our cluster configuration. I know MPI simply relies on the underlying network. However, we can ping and ssh to all nodes (and in between and pair as well) so it

Re: [OMPI users] MPI hangs on multiple nodes

2011-09-20 Thread Rolf vandeVaart
>> 1: After a reboot of two nodes I ran again, and the inter-node freeze didn't >happen until the third iteration. I take that to mean that the basic >communication works, but that something is saturating. Is there some notion >of buffer size somewhere in the MPI system that could explain this? >

Re: [OMPI users] MPI hangs on multiple nodes

2011-09-20 Thread Jeff Squyres
On Sep 19, 2011, at 10:23 PM, Ole Nielsen wrote: > Hi all - and sorry for the multiple postings, but I have more information. +1 on Eugene's comments. The test program looks fine to me. FWIW, you don't need -lmpi to compile your program; OMPI's wrapper compiler allows you to just: mpicc m

[OMPI users] MPI hangs on multiple nodes

2011-09-19 Thread Ole Nielsen
Hi all - and sorry for the multiple postings, but I have more information. 1: After a reboot of two nodes I ran again, and the inter-node freeze didn't happen until the third iteration. I take that to mean that the basic communication works, but that something is saturating. Is there some notion o

[OMPI users] MPI hangs on multiple nodes

2011-09-19 Thread Ole Nielsen
Thanks for your suggestion Gus, we need a way of debugging what is going on. I am pretty sure the problem lies with our cluster configuration. I know MPI simply relies on the underlying network. However, we can ping and ssh to all nodes (and in between and pair as well) so it is currently a mystery

Re: [OMPI users] MPI hangs on multiple nodes

2011-09-19 Thread devendra rai
./mpi_test So, maybe this helps you. Best, Devendra Rai From: Ole Nielsen To: us...@open-mpi.org Sent: Monday, 19 September 2011, 10:59 Subject: [OMPI users] MPI hangs on multiple nodes The test program is available here: http://code.google.com/p/pypar/source

[OMPI users] MPI hangs on multiple nodes

2011-09-19 Thread Ole Nielsen
The test program is available here: http://code.google.com/p/pypar/source/browse/source/mpi_test.c Hopefully, someone can help us troubleshoot why communications stop when multiple nodes are involved and CPU usage goes to 100% for as long as we leave the program running. Many thanks Ole Nielsen

Re: [OMPI users] MPI hangs on multiple nodes

2011-09-19 Thread Ole Nielsen
Further to the posting below, I can report that the test program (attached - this time correctly) is chewing up CPU time on both compute nodes for as long as I care to let it continue. It would appear that MPI_Receive which is the next command after the print statements in the test program. Has an

[OMPI users] MPI hangs on multiple nodes

2011-09-19 Thread Ole Nielsen
Hi all We have been using OpenMPI for many years with Ubuntu on our 20-node cluster. Each node has 2 quad cores, so we usually run up to 8 processes on each node up to a maximum of 160 processes. However, we just upgraded the cluster to Ubuntu 11.04 with Open MPI 1.4.3 and and have come across a