Something doesn't make sense here. If you direct launch with srun, there is no 
orted involved. The orted only gets launched if you start with mpirun

Did you configure --with-pmi and point to where that include file resides? 
Otherwise, the procs will all think they are singletons 

Sent from my iPhone

On Oct 12, 2012, at 7:27 AM, Michael Di Domenico <mdidomeni...@gmail.com> wrote:

> what isn't working is when i fire off an MPI job with over 800 ranks,
> they don't all actually start up a process
> 
> fe, if i do srun -n 1024 --ntasks-per-node 12 xhpl
> 
> and then do a 'pgrep xhpl | wc -l', on all of the allocated nodes, not
> all of them have actually started xhpl
> 
> most will read 12 started processes, but an inconsistent list of nodes
> will fail to actually start xhpl and stall the whole job
> 
> if i look at all the nodes allocated to my job, it does start the orte
> process though
> 
> what i need to figure out, is why the orte process starts, but fails
> to actually start xhpl on some of the nodes
> 
> unfortunately, the list of nodes that don't start xhpl during my runs
> changes each time and no hardware errors are being detected.  if i
> cancel the job and restart the job over and over, eventually one will
> actually kick off and run to completion.
> 
> if i run the process outside of slurm just using openmpi, it seems to
> behave correctly, so i'm leaning towards a slurm interacting with
> openmpi problem.
> 
> what i'd like to do is instrument a debug in openmpi that will tell me
> what openmpi is waiting on in order to kick off the xhpl binary
> 
> i'm testing to see whether it's a psm related problem now, i'll check
> back if i can narrow the scope a little more
> 
> On Thu, Oct 11, 2012 at 10:21 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> I'm afraid I'm confused - I don't understand what is and isn't working. What
>> "next process" isn't starting?
>> 
>> 
>> On Thu, Oct 11, 2012 at 9:41 AM, Michael Di Domenico
>> <mdidomeni...@gmail.com> wrote:
>>> 
>>> adding some additional info
>>> 
>>> did an strace on an orted process where xhpl failed to start, i did
>>> this after the mpirun execution, so i probably missed some output, but
>>> it keeps scrolling
>>> 
>>> poll([{fd=4, events=POLLIN},{fd=7, events=POLLIN},{fd=8,
>>> events=POLLIN},{fd=10, events=POLLIN},{fd=12, events=POLLIN},{fd=13,
>>> events=POLLIN},{fd=14, events=POLLIN},{fd=15, events=POLLIN},{fd=16,
>>> events=POLLIN}], 9, 1000) = 0 (Timeout)
>>> 
>>> i didn't see anything useful in /proc under those file descriptors,
>>> but perhaps i missed something i don't know to look for
>>> 
>>> On Thu, Oct 11, 2012 at 12:06 PM, Michael Di Domenico
>>> <mdidomeni...@gmail.com> wrote:
>>>> too add a little more detail, it looks like xhpl is not actually
>>>> starting on all nodes when i kick off the mpirun
>>>> 
>>>> each time i cancel and restart the job, the nodes that do not start
>>>> change, so i can't call it a bad node
>>>> 
>>>> if i disable infiniband with --mca btl self,sm,tcp on occasion i can
>>>> get xhpl to actually run, but it's not consistent
>>>> 
>>>> i'm going to check my ethernet network and make sure there's no
>>>> problems there (could this be an OOB error with mpirun?), on the nodes
>>>> that fail to start xhpl, i do see the orte process, but nothing in the
>>>> logs about why it failed to launch xhpl
>>>> 
>>>> 
>>>> 
>>>> On Thu, Oct 11, 2012 at 11:49 AM, Michael Di Domenico
>>>> <mdidomeni...@gmail.com> wrote:
>>>>> I'm trying to diagnose an MPI job (in this case xhpl), that fails to
>>>>> start when the rank count gets fairly high into the thousands.
>>>>> 
>>>>> My symptom is the jobs fires up via slurm, and I can see all the xhpl
>>>>> processes on the nodes, but it never kicks over to the next process.
>>>>> 
>>>>> My question is, what debugs should I turn on to tell me what the
>>>>> system might be waiting on?
>>>>> 
>>>>> I've checked a bunch of things, but I'm probably overlooking something
>>>>> trivial (which is par for me).
>>>>> 
>>>>> I'm using the Openmpi 1.6.1, Slurm 2.4.2 on CentOS 6.3, with
>>>>> Infiniband/PSM
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to