Am 05.04.2012 um 17:55 schrieb Eloi Gaudry: > > >> Here are the allocation info retrieved from `qstat -g t` for the related > >> job: > > > > For me the output of `qstat -g t` shows MASTER and SLAVE entries but no > > variables. Is there any wrapper defined for `qstat` to reformat the output > > (or a ~/.sge_qstat defined)? > > > > [eg: ] sorry, i forgot about sge_qstat being defined. As I don't have any > > slot available right now, I cannot relaunch the job to get the output > > updated. > Reuti, here is the output you asked two days ago. > It was produced with another "bad" run for which 3 processes are running on > nodes charlie and carl... but we should have only 2 processes on carl and 4 > on charlie...
This is indeed strange, as it first detects the correct allocation. And it conforms to the one granted. - You used a plain `mpiexec` without and number of processes or machinesfile? - Can you please post while it's running the relevant lines from: ps -e f --cols=500 (f w/o -) from both machines. It's allocated between the nodes more like in a round-robin fashion. -- Reuti > > Output from qstat -g t: > ------------------------------------ > queuename qtype resv/used/tot. load_avg arch > states > --------------------------------------------------------------------------------- > smp...@carl.fft BIP 0/2/4 1.14 lx-amd64 > hc:mem_available=1.715G > 1391 0.57643 semi_green jj r 04/05/2012 15:41:04 SLAVE > SLAVE > --------------------------------------------------------------------------------- > smp...@charlie.fft BIP 0/4/8 1.73 lx-amd64 > hc:mem_available=4.018G > 1391 0.57643 semi_green jj r 04/05/2012 15:41:04 MASTER > SLAVE > SLAVE > SLAVE > SLAVE > > Debug output from orterun: > ------------------------------------ > [charlie:08194] ras:gridengine: JOB_ID: 1391 > [charlie:08194] ras:gridengine: PE_HOSTFILE: > /opt/sge/default/spool/charlie/active_jobs/1391.1/pe_hostfile > [charlie:08194] ras:gridengine: charlie.fft: PE_HOSTFILE shows slots=4 > [charlie:08194] ras:gridengine: carl.fft: PE_HOSTFILE shows slots=2 > > ====================== ALLOCATED NODES ====================== > > Data for node: Name: charlie Launch id: -1 Arch: ffc91200 State: 2 > Num boards: 1 Num sockets/board: 2 Num cores/socket: 4 > Daemon: [[57575,0],0] Daemon launched: True > Num slots: 4 Slots in use: 0 > Num slots allocated: 4 Max slots: 0 > Username on node: NULL > Num procs: 0 Next node_rank: 0 > Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2 > Num boards: 1 Num sockets/board: 2 Num cores/socket: 4 > Daemon: Not defined Daemon launched: False > Num slots: 2 Slots in use: 0 > Num slots allocated: 2 Max slots: 0 > Username on node: NULL > Num procs: 0 Next node_rank: 0 > > ================================================================= > > Map generated by mapping policy: 0200 > Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSE > Num new daemons: 1 New daemon starting vpid 1 > Num nodes: 2 > > Data for node: Name: charlie Launch id: -1 Arch: ffc91200 State: 2 > Num boards: 1 Num sockets/board: 2 Num cores/socket: 4 > Daemon: [[57575,0],0] Daemon launched: True > Num slots: 4 Slots in use: 3 > Num slots allocated: 4 Max slots: 0 > Username on node: NULL > Num procs: 3 Next node_rank: 3 > Data for proc: [[57575,1],0] > Pid: 0 Local rank: 0 Node rank: 0 > State: 0 App_context: 0 Slot list: NULL > Data for proc: [[57575,1],2] > Pid: 0 Local rank: 1 Node rank: 1 > State: 0 App_context: 0 Slot list: NULL > Data for proc: [[57575,1],4] > Pid: 0 Local rank: 2 Node rank: 2 > State: 0 App_context: 0 Slot list: NULL > > Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2 > Num boards: 1 Num sockets/board: 2 Num cores/socket: 4 > Daemon: [[57575,0],1] Daemon launched: False > Num slots: 2 Slots in use: 3 > Num slots allocated: 2 Max slots: 0 > Username on node: NULL > Num procs: 3 Next node_rank: 3 > Data for proc: [[57575,1],1] > Pid: 0 Local rank: 0 Node rank: 0 > State: 0 App_context: 0 Slot list: NULL > Data for proc: [[57575,1],3] > Pid: 0 Local rank: 1 Node rank: 1 > State: 0 App_context: 0 Slot list: NULL > Data for proc: [[57575,1],5] > Pid: 0 Local rank: 2 Node rank: 2 > State: 0 App_context: 0 Slot list: NULL > > > > Regards, > Eloi > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users