On Feb 4, 2013, at 10:55 AM, Bharath Ramesh <bram...@vt.edu> wrote: > I am trying to debug an issue which is really weird. I have > simple MPI hello world application (attached) that hangs when I > try to run on our cluster using 256 nodes with 16 cores on each > node. The cluster uses QDR IB. > > I am able to run the test over ethernet by excluding openib from > the btl. However, what is weird is that for the same set of nodes > xhpl completes without any error using 256 nodes and 16 cores. I > have tried running the Pallas MPI Benchmark and it also behaves > similarly to hello world and ends up hanging when I run it using > 256 nodes.
Sorry for the delay; I was on travel all last week and fell behind. I'm not sure I can parse your scenario description. Are you saying: - hello world over IB hangs at 256*16 procs - hello world over TCP works at 256*16 procs - xhpl over TCP works at 256*16 procs - IMB over ?TCP|IB? hangs at 256*16 procs > When I attach gdb to the MPI processes and look at the backtrace > I see that close ~1000 of the MPI processes are stuck in MPI_Send > while the others are waiting in MPI_Finalize. I have checked to > make sure that the ulimit setting for locked memory is unlimited. > The number of open files per process is 131072. The default MPI > stack provided is openmpi-1.6.1 on the system. I compiled > openmpi-1.6.3 in my home directory and the behavior remains to be > the same. > > I would appreciate any help in debugging this issue. Can you try the 1.6.4rc? http://www.open-mpi.org/software/ompi/v1.6/ > -- > Bharath > <hello_world_mpi.c>_______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/