Re: [OMPI users] MPI hangs on multiple nodes

2011-09-25 Thread Ole Nielsen
erences I see between guillimin and colosse are > > > > - Open-MPI 1.4.3 (colosse) v. MVAPICH2 1.6 (guillimin) > > - Mellanox (colosse) v. QLogic (guillimin) > > > > > > Does anyone experienced such a high latency with Open-MPI 1.4.3 on > Mellanox HCAs ? >

Re: [OMPI users] MPI hangs on multiple nodes

2011-09-20 Thread Gus Correa
Ole Nielsen wrote: Thanks for your suggestion Gus, we need a way of debugging what is going on. I am pretty sure the problem lies with our cluster configuration. I know MPI simply relies on the underlying network. However, we can ping and ssh to all nodes (and in between and pair as well) so it

Re: [OMPI users] MPI hangs on multiple nodes

2011-09-20 Thread Rolf vandeVaart
>> 1: After a reboot of two nodes I ran again, and the inter-node freeze didn't >happen until the third iteration. I take that to mean that the basic >communication works, but that something is saturating. Is there some notion >of buffer size somewhere in the MPI system that could explain this? >

Re: [OMPI users] MPI hangs on multiple nodes

2011-09-20 Thread Jeff Squyres
On Sep 19, 2011, at 10:23 PM, Ole Nielsen wrote: > Hi all - and sorry for the multiple postings, but I have more information. +1 on Eugene's comments. The test program looks fine to me. FWIW, you don't need -lmpi to compile your program; OMPI's wrapper compiler allows you to just: mpicc m

Re: [OMPI users] MPI hangs on multiple nodes

2011-09-19 Thread devendra rai
Hello Ole I ran your program on open-mpi-1.4.2  five times, and all five times, it finished successfully. So, I think the problem was with the version of mpi. Output from your program is attached. I ran on 3 nodes: $home/OpenMPI-1.4.2/bin/mpirun -np 3 -v --output-filename mpi_testfile ./mpi_t

Re: [OMPI users] MPI hangs on multiple nodes

2011-09-19 Thread Ole Nielsen
Further to the posting below, I can report that the test program (attached - this time correctly) is chewing up CPU time on both compute nodes for as long as I care to let it continue. It would appear that MPI_Receive which is the next command after the print statements in the test program. Has an