I'm trying to diagnose an MPI job (in this case xhpl), that fails to start when the rank count gets fairly high into the thousands.
My symptom is the jobs fires up via slurm, and I can see all the xhpl processes on the nodes, but it never kicks over to the next process. My question is, what debugs should I turn on to tell me what the system might be waiting on? I've checked a bunch of things, but I'm probably overlooking something trivial (which is par for me). I'm using the Openmpi 1.6.1, Slurm 2.4.2 on CentOS 6.3, with Infiniband/PSM