Something doesn't make sense here. If you direct launch with srun, there is no
orted involved. The orted only gets launched if you start with mpirun
Did you configure --with-pmi and point to where that include file resides?
Otherwise, the procs will all think they are singletons
Sent from my i
Hi
I don't use Slurm, and our clusters are fairly small (few tens of nodes,
few hundred cores).
Having said that, I know that Torque, which we use here,
requires specific system configuration changes on large clusters,
like increasing the maximum number of open files,
increasing the ARP cache siz
turned on the daemon debugs for orted and noticed this difference
i get this on all the good nodes (ones that actually started xhpl)
Daemon was launched on node08 - beginning to initialize
[node08:21230] [[64354,0],1] orted_cmd: received add_local_procs
[node08:21230] [[64354,0],0] orted_rec
what isn't working is when i fire off an MPI job with over 800 ranks,
they don't all actually start up a process
fe, if i do srun -n 1024 --ntasks-per-node 12 xhpl
and then do a 'pgrep xhpl | wc -l', on all of the allocated nodes, not
all of them have actually started xhpl
most will read 12 star
I'm afraid I'm confused - I don't understand what is and isn't working.
What "next process" isn't starting?
On Thu, Oct 11, 2012 at 9:41 AM, Michael Di Domenico wrote:
> adding some additional info
>
> did an strace on an orted process where xhpl failed to start, i did
> this after the mpirun e
adding some additional info
did an strace on an orted process where xhpl failed to start, i did
this after the mpirun execution, so i probably missed some output, but
it keeps scrolling
poll([{fd=4, events=POLLIN},{fd=7, events=POLLIN},{fd=8,
events=POLLIN},{fd=10, events=POLLIN},{fd=12, events=P
too add a little more detail, it looks like xhpl is not actually
starting on all nodes when i kick off the mpirun
each time i cancel and restart the job, the nodes that do not start
change, so i can't call it a bad node
if i disable infiniband with --mca btl self,sm,tcp on occasion i can
get xhpl
I'm trying to diagnose an MPI job (in this case xhpl), that fails to
start when the rank count gets fairly high into the thousands.
My symptom is the jobs fires up via slurm, and I can see all the xhpl
processes on the nodes, but it never kicks over to the next process.
My question is, what debug