>> Here are the allocation info retrieved from `qstat -g t` for the related job:
> 
> For me the output of `qstat -g t` shows MASTER and SLAVE entries but no 
> variables. Is there any wrapper defined for `qstat` to reformat the output 
> (or a ~/.sge_qstat defined)?
> 
> [eg: ] sorry, i forgot about sge_qstat being defined. As I don't have any 
> slot available right now, I cannot relaunch the job to get the output updated.

Reuti, here is the output you asked two days ago.

It was produced with another "bad" run for which 3 processes are running on 
nodes charlie and carl... but we should have only 2 processes on carl and 4 on 
charlie...

 
Output from qstat -g t:

------------------------------------

queuename                      qtype resv/used/tot. load_avg arch          
states
---------------------------------------------------------------------------------
smp...@carl.fft                BIP   0/2/4          1.14     lx-amd64
  hc:mem_available=1.715G
   1391 0.57643 semi_green jj           r     04/05/2012 15:41:04 SLAVE
                                                                  SLAVE
---------------------------------------------------------------------------------
smp...@charlie.fft             BIP   0/4/8          1.73     lx-amd64
  hc:mem_available=4.018G
   1391 0.57643 semi_green jj           r     04/05/2012 15:41:04 MASTER
                                                                  SLAVE
                                                                  SLAVE
                                                                  SLAVE
                                                                  SLAVE

 
Debug output from orterun:

------------------------------------
[charlie:08194] ras:gridengine: JOB_ID: 1391
[charlie:08194] ras:gridengine: PE_HOSTFILE: 
/opt/sge/default/spool/charlie/active_jobs/1391.1/pe_hostfile
[charlie:08194] ras:gridengine: charlie.fft: PE_HOSTFILE shows slots=4
[charlie:08194] ras:gridengine: carl.fft: PE_HOSTFILE shows slots=2

======================   ALLOCATED NODES   ======================

 Data for node: Name: charlie   Launch id: -1 Arch: ffc91200  State: 2
  Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
  Daemon: [[57575,0],0] Daemon launched: True
  Num slots: 4  Slots in use: 0
  Num slots allocated: 4  Max slots: 0
  Username on node: NULL
  Num procs: 0  Next node_rank: 0
 Data for node: Name: carl.fft    Launch id: -1 Arch: 0 State: 2
  Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
  Daemon: Not defined Daemon launched: False
  Num slots: 2  Slots in use: 0
  Num slots allocated: 2  Max slots: 0
  Username on node: NULL
  Num procs: 0  Next node_rank: 0

=================================================================

 Map generated by mapping policy: 0200
  Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSE
  Num new daemons: 1  New daemon starting vpid 1
  Num nodes: 2

 Data for node: Name: charlie   Launch id: -1 Arch: ffc91200  State: 2
  Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
  Daemon: [[57575,0],0] Daemon launched: True
  Num slots: 4  Slots in use: 3
  Num slots allocated: 4  Max slots: 0
  Username on node: NULL
  Num procs: 3  Next node_rank: 3
  Data for proc: [[57575,1],0]
    Pid: 0  Local rank: 0 Node rank: 0
    State: 0  App_context: 0  Slot list: NULL
  Data for proc: [[57575,1],2]
    Pid: 0  Local rank: 1 Node rank: 1
    State: 0  App_context: 0  Slot list: NULL
  Data for proc: [[57575,1],4]
    Pid: 0  Local rank: 2 Node rank: 2
    State: 0  App_context: 0  Slot list: NULL

 Data for node: Name: carl.fft    Launch id: -1 Arch: 0 State: 2
  Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
  Daemon: [[57575,0],1] Daemon launched: False
  Num slots: 2  Slots in use: 3
  Num slots allocated: 2  Max slots: 0
  Username on node: NULL
  Num procs: 3  Next node_rank: 3
  Data for proc: [[57575,1],1]
    Pid: 0  Local rank: 0 Node rank: 0
    State: 0  App_context: 0  Slot list: NULL
  Data for proc: [[57575,1],3]
    Pid: 0  Local rank: 1 Node rank: 1
    State: 0  App_context: 0  Slot list: NULL
  Data for proc: [[57575,1],5]
    Pid: 0  Local rank: 2 Node rank: 2
    State: 0  App_context: 0  Slot list: NULL

 
 
Regards,

Eloi

Reply via email to