Re: [OMPI users] debugs for jobs not starting

2012-10-12 Thread Ralph Castain
Something doesn't make sense here. If you direct launch with srun, there is no orted involved. The orted only gets launched if you start with mpirun Did you configure --with-pmi and point to where that include file resides? Otherwise, the procs will all think they are singletons Sent from my i

Re: [OMPI users] debugs for jobs not starting

2012-10-12 Thread Gus Correa
Hi I don't use Slurm, and our clusters are fairly small (few tens of nodes, few hundred cores). Having said that, I know that Torque, which we use here, requires specific system configuration changes on large clusters, like increasing the maximum number of open files, increasing the ARP cache siz

Re: [OMPI users] debugs for jobs not starting

2012-10-12 Thread Michael Di Domenico
turned on the daemon debugs for orted and noticed this difference i get this on all the good nodes (ones that actually started xhpl) Daemon was launched on node08 - beginning to initialize [node08:21230] [[64354,0],1] orted_cmd: received add_local_procs [node08:21230] [[64354,0],0] orted_rec

Re: [OMPI users] debugs for jobs not starting

2012-10-12 Thread Michael Di Domenico
what isn't working is when i fire off an MPI job with over 800 ranks, they don't all actually start up a process fe, if i do srun -n 1024 --ntasks-per-node 12 xhpl and then do a 'pgrep xhpl | wc -l', on all of the allocated nodes, not all of them have actually started xhpl most will read 12 star

Re: [OMPI users] debugs for jobs not starting

2012-10-11 Thread Ralph Castain
I'm afraid I'm confused - I don't understand what is and isn't working. What "next process" isn't starting? On Thu, Oct 11, 2012 at 9:41 AM, Michael Di Domenico wrote: > adding some additional info > > did an strace on an orted process where xhpl failed to start, i did > this after the mpirun e

Re: [OMPI users] debugs for jobs not starting

2012-10-11 Thread Michael Di Domenico
adding some additional info did an strace on an orted process where xhpl failed to start, i did this after the mpirun execution, so i probably missed some output, but it keeps scrolling poll([{fd=4, events=POLLIN},{fd=7, events=POLLIN},{fd=8, events=POLLIN},{fd=10, events=POLLIN},{fd=12, events=P

Re: [OMPI users] debugs for jobs not starting

2012-10-11 Thread Michael Di Domenico
too add a little more detail, it looks like xhpl is not actually starting on all nodes when i kick off the mpirun each time i cancel and restart the job, the nodes that do not start change, so i can't call it a bad node if i disable infiniband with --mca btl self,sm,tcp on occasion i can get xhpl

[OMPI users] debugs for jobs not starting

2012-10-11 Thread Michael Di Domenico
I'm trying to diagnose an MPI job (in this case xhpl), that fails to start when the rank count gets fairly high into the thousands. My symptom is the jobs fires up via slurm, and I can see all the xhpl processes on the nodes, but it never kicks over to the next process. My question is, what debug