Something doesn't make sense here. If you direct launch with srun, there is no orted involved. The orted only gets launched if you start with mpirun
Did you configure --with-pmi and point to where that include file resides? Otherwise, the procs will all think they are singletons Sent from my iPhone On Oct 12, 2012, at 7:27 AM, Michael Di Domenico <mdidomeni...@gmail.com> wrote: > what isn't working is when i fire off an MPI job with over 800 ranks, > they don't all actually start up a process > > fe, if i do srun -n 1024 --ntasks-per-node 12 xhpl > > and then do a 'pgrep xhpl | wc -l', on all of the allocated nodes, not > all of them have actually started xhpl > > most will read 12 started processes, but an inconsistent list of nodes > will fail to actually start xhpl and stall the whole job > > if i look at all the nodes allocated to my job, it does start the orte > process though > > what i need to figure out, is why the orte process starts, but fails > to actually start xhpl on some of the nodes > > unfortunately, the list of nodes that don't start xhpl during my runs > changes each time and no hardware errors are being detected. if i > cancel the job and restart the job over and over, eventually one will > actually kick off and run to completion. > > if i run the process outside of slurm just using openmpi, it seems to > behave correctly, so i'm leaning towards a slurm interacting with > openmpi problem. > > what i'd like to do is instrument a debug in openmpi that will tell me > what openmpi is waiting on in order to kick off the xhpl binary > > i'm testing to see whether it's a psm related problem now, i'll check > back if i can narrow the scope a little more > > On Thu, Oct 11, 2012 at 10:21 PM, Ralph Castain <r...@open-mpi.org> wrote: >> I'm afraid I'm confused - I don't understand what is and isn't working. What >> "next process" isn't starting? >> >> >> On Thu, Oct 11, 2012 at 9:41 AM, Michael Di Domenico >> <mdidomeni...@gmail.com> wrote: >>> >>> adding some additional info >>> >>> did an strace on an orted process where xhpl failed to start, i did >>> this after the mpirun execution, so i probably missed some output, but >>> it keeps scrolling >>> >>> poll([{fd=4, events=POLLIN},{fd=7, events=POLLIN},{fd=8, >>> events=POLLIN},{fd=10, events=POLLIN},{fd=12, events=POLLIN},{fd=13, >>> events=POLLIN},{fd=14, events=POLLIN},{fd=15, events=POLLIN},{fd=16, >>> events=POLLIN}], 9, 1000) = 0 (Timeout) >>> >>> i didn't see anything useful in /proc under those file descriptors, >>> but perhaps i missed something i don't know to look for >>> >>> On Thu, Oct 11, 2012 at 12:06 PM, Michael Di Domenico >>> <mdidomeni...@gmail.com> wrote: >>>> too add a little more detail, it looks like xhpl is not actually >>>> starting on all nodes when i kick off the mpirun >>>> >>>> each time i cancel and restart the job, the nodes that do not start >>>> change, so i can't call it a bad node >>>> >>>> if i disable infiniband with --mca btl self,sm,tcp on occasion i can >>>> get xhpl to actually run, but it's not consistent >>>> >>>> i'm going to check my ethernet network and make sure there's no >>>> problems there (could this be an OOB error with mpirun?), on the nodes >>>> that fail to start xhpl, i do see the orte process, but nothing in the >>>> logs about why it failed to launch xhpl >>>> >>>> >>>> >>>> On Thu, Oct 11, 2012 at 11:49 AM, Michael Di Domenico >>>> <mdidomeni...@gmail.com> wrote: >>>>> I'm trying to diagnose an MPI job (in this case xhpl), that fails to >>>>> start when the rank count gets fairly high into the thousands. >>>>> >>>>> My symptom is the jobs fires up via slurm, and I can see all the xhpl >>>>> processes on the nodes, but it never kicks over to the next process. >>>>> >>>>> My question is, what debugs should I turn on to tell me what the >>>>> system might be waiting on? >>>>> >>>>> I've checked a bunch of things, but I'm probably overlooking something >>>>> trivial (which is par for me). >>>>> >>>>> I'm using the Openmpi 1.6.1, Slurm 2.4.2 on CentOS 6.3, with >>>>> Infiniband/PSM >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users