I'm afraid I'm confused - I don't understand what is and isn't working. What "next process" isn't starting?
On Thu, Oct 11, 2012 at 9:41 AM, Michael Di Domenico <mdidomeni...@gmail.com > wrote: > adding some additional info > > did an strace on an orted process where xhpl failed to start, i did > this after the mpirun execution, so i probably missed some output, but > it keeps scrolling > > poll([{fd=4, events=POLLIN},{fd=7, events=POLLIN},{fd=8, > events=POLLIN},{fd=10, events=POLLIN},{fd=12, events=POLLIN},{fd=13, > events=POLLIN},{fd=14, events=POLLIN},{fd=15, events=POLLIN},{fd=16, > events=POLLIN}], 9, 1000) = 0 (Timeout) > > i didn't see anything useful in /proc under those file descriptors, > but perhaps i missed something i don't know to look for > > On Thu, Oct 11, 2012 at 12:06 PM, Michael Di Domenico > <mdidomeni...@gmail.com> wrote: > > too add a little more detail, it looks like xhpl is not actually > > starting on all nodes when i kick off the mpirun > > > > each time i cancel and restart the job, the nodes that do not start > > change, so i can't call it a bad node > > > > if i disable infiniband with --mca btl self,sm,tcp on occasion i can > > get xhpl to actually run, but it's not consistent > > > > i'm going to check my ethernet network and make sure there's no > > problems there (could this be an OOB error with mpirun?), on the nodes > > that fail to start xhpl, i do see the orte process, but nothing in the > > logs about why it failed to launch xhpl > > > > > > > > On Thu, Oct 11, 2012 at 11:49 AM, Michael Di Domenico > > <mdidomeni...@gmail.com> wrote: > >> I'm trying to diagnose an MPI job (in this case xhpl), that fails to > >> start when the rank count gets fairly high into the thousands. > >> > >> My symptom is the jobs fires up via slurm, and I can see all the xhpl > >> processes on the nodes, but it never kicks over to the next process. > >> > >> My question is, what debugs should I turn on to tell me what the > >> system might be waiting on? > >> > >> I've checked a bunch of things, but I'm probably overlooking something > >> trivial (which is par for me). > >> > >> I'm using the Openmpi 1.6.1, Slurm 2.4.2 on CentOS 6.3, with > Infiniband/PSM > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >