I'm afraid I'm confused - I don't understand what is and isn't working.
What "next process" isn't starting?


On Thu, Oct 11, 2012 at 9:41 AM, Michael Di Domenico <mdidomeni...@gmail.com
> wrote:

> adding some additional info
>
> did an strace on an orted process where xhpl failed to start, i did
> this after the mpirun execution, so i probably missed some output, but
> it keeps scrolling
>
> poll([{fd=4, events=POLLIN},{fd=7, events=POLLIN},{fd=8,
> events=POLLIN},{fd=10, events=POLLIN},{fd=12, events=POLLIN},{fd=13,
> events=POLLIN},{fd=14, events=POLLIN},{fd=15, events=POLLIN},{fd=16,
> events=POLLIN}], 9, 1000) = 0 (Timeout)
>
> i didn't see anything useful in /proc under those file descriptors,
> but perhaps i missed something i don't know to look for
>
> On Thu, Oct 11, 2012 at 12:06 PM, Michael Di Domenico
> <mdidomeni...@gmail.com> wrote:
> > too add a little more detail, it looks like xhpl is not actually
> > starting on all nodes when i kick off the mpirun
> >
> > each time i cancel and restart the job, the nodes that do not start
> > change, so i can't call it a bad node
> >
> > if i disable infiniband with --mca btl self,sm,tcp on occasion i can
> > get xhpl to actually run, but it's not consistent
> >
> > i'm going to check my ethernet network and make sure there's no
> > problems there (could this be an OOB error with mpirun?), on the nodes
> > that fail to start xhpl, i do see the orte process, but nothing in the
> > logs about why it failed to launch xhpl
> >
> >
> >
> > On Thu, Oct 11, 2012 at 11:49 AM, Michael Di Domenico
> > <mdidomeni...@gmail.com> wrote:
> >> I'm trying to diagnose an MPI job (in this case xhpl), that fails to
> >> start when the rank count gets fairly high into the thousands.
> >>
> >> My symptom is the jobs fires up via slurm, and I can see all the xhpl
> >> processes on the nodes, but it never kicks over to the next process.
> >>
> >> My question is, what debugs should I turn on to tell me what the
> >> system might be waiting on?
> >>
> >> I've checked a bunch of things, but I'm probably overlooking something
> >> trivial (which is par for me).
> >>
> >> I'm using the Openmpi 1.6.1, Slurm 2.4.2 on CentOS 6.3, with
> Infiniband/PSM
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to