On Feb 4, 2013, at 10:55 AM, Bharath Ramesh <bram...@vt.edu> wrote:

> I am trying to debug an issue which is really weird. I have
> simple MPI hello world application (attached) that hangs when I
> try to run on our cluster using 256 nodes with 16 cores on each
> node. The cluster uses QDR IB.
> 
> I am able to run the test over ethernet by excluding openib from
> the btl. However, what is weird is that for the same set of nodes
> xhpl completes without any error using 256 nodes and 16 cores. I
> have tried running the Pallas MPI Benchmark and it also behaves
> similarly to hello world and ends up hanging when I run it using
> 256 nodes.

Sorry for the delay; I was on travel all last week and fell behind.

I'm not sure I can parse your scenario description.  Are you saying:

- hello world over IB hangs at 256*16 procs
- hello world over TCP works at 256*16 procs
- xhpl over TCP works at 256*16 procs
- IMB over ?TCP|IB? hangs at 256*16 procs

> When I attach gdb to the MPI processes and look at the backtrace
> I see that close ~1000 of the MPI processes are stuck in MPI_Send
> while the others are waiting in MPI_Finalize. I have checked to
> make sure that the ulimit setting for locked memory is unlimited.
> The number of open files per process is 131072. The default MPI
> stack provided is openmpi-1.6.1 on the system. I compiled
> openmpi-1.6.3 in my home directory and the behavior remains to be
> the same.
> 
> I would appreciate any help in debugging this issue.

Can you try the 1.6.4rc?  http://www.open-mpi.org/software/ompi/v1.6/

> -- 
> Bharath
> <hello_world_mpi.c>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to