On Mon, Apr 22, 2013 at 03:17:16PM -0700, Mike Clark wrote:
> Hi,
> 
> I am trying to run OpenMPI on the Cray XK7 system at Oak Ridge National Lab 
> (Titan), and am running in an issue whereby MPI_Init seems to hang 
> indefinitely, but this issue only arises at large scale, e.g., when running 
> on 18560 compute nodes (with two MPI processes per node).  The application 
> runs successfully on 4600 nodes, and we are currently trying to test a 9000 
> node job to see if this fails or runs.
> 
> We are launching our job using something like the following
> 
> # mpirun command                                                              
>                      
> mpicmd="$OMP_DIR/bin/mpirun --prefix $OMP_DIR -np 37120 --npernode 2 
> --bind-to core --bind-to numa $app $args"
> # Print  and Run the Command                                                  
>                      
> echo $mpicmd
> $mpicmd >& $output
> 
> Are there any issues that I should be aware of when running OpenMPI on 37120 
> processes or when running on the Cray Gemini Interconnect?

We have only tested Open MPI up to 131072 ranks on 8192 nodes. Have you tried 
running DDT on the process to see where it is hung up?

I have a Titan account so I can help with debugging. I would like to get this 
issue fixed in 1.7.2.

-Nathan

Reply via email to