Am 05.04.2012 um 17:55 schrieb Eloi Gaudry:

> 
> >> Here are the allocation info retrieved from `qstat -g t` for the related 
> >> job:
> > 
> > For me the output of `qstat -g t` shows MASTER and SLAVE entries but no 
> > variables. Is there any wrapper defined for `qstat` to reformat the output 
> > (or a ~/.sge_qstat defined)?
> > 
> > [eg: ] sorry, i forgot about sge_qstat being defined. As I don't have any 
> > slot available right now, I cannot relaunch the job to get the output 
> > updated.
> Reuti, here is the output you asked two days ago.
> It was produced with another "bad" run for which 3 processes are running on 
> nodes charlie and carl... but we should have only 2 processes on carl and 4 
> on charlie...

This is indeed strange, as it first detects the correct allocation. And it 
conforms to the one granted.

- You used a plain `mpiexec` without and number of processes or machinesfile?
- Can you please post while it's running the relevant lines from:

ps -e f --cols=500

(f w/o -) from both machines.

It's allocated between the nodes more like in a round-robin fashion.

-- Reuti


>  
> Output from qstat -g t:
> ------------------------------------
> queuename                      qtype resv/used/tot. load_avg arch          
> states
> ---------------------------------------------------------------------------------
> smp...@carl.fft                BIP   0/2/4          1.14     lx-amd64
>   hc:mem_available=1.715G
>    1391 0.57643 semi_green jj           r     04/05/2012 15:41:04 SLAVE
>                                                                   SLAVE
> ---------------------------------------------------------------------------------
> smp...@charlie.fft             BIP   0/4/8          1.73     lx-amd64
>   hc:mem_available=4.018G
>    1391 0.57643 semi_green jj           r     04/05/2012 15:41:04 MASTER
>                                                                   SLAVE
>                                                                   SLAVE
>                                                                   SLAVE
>                                                                   SLAVE
>  
> Debug output from orterun:
> ------------------------------------
> [charlie:08194] ras:gridengine: JOB_ID: 1391
> [charlie:08194] ras:gridengine: PE_HOSTFILE: 
> /opt/sge/default/spool/charlie/active_jobs/1391.1/pe_hostfile
> [charlie:08194] ras:gridengine: charlie.fft: PE_HOSTFILE shows slots=4
> [charlie:08194] ras:gridengine: carl.fft: PE_HOSTFILE shows slots=2
> 
> ======================   ALLOCATED NODES   ======================
> 
>  Data for node: Name: charlie   Launch id: -1 Arch: ffc91200  State: 2
>   Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
>   Daemon: [[57575,0],0] Daemon launched: True
>   Num slots: 4  Slots in use: 0
>   Num slots allocated: 4  Max slots: 0
>   Username on node: NULL
>   Num procs: 0  Next node_rank: 0
>  Data for node: Name: carl.fft    Launch id: -1 Arch: 0 State: 2
>   Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
>   Daemon: Not defined Daemon launched: False
>   Num slots: 2  Slots in use: 0
>   Num slots allocated: 2  Max slots: 0
>   Username on node: NULL
>   Num procs: 0  Next node_rank: 0
> 
> =================================================================
> 
>  Map generated by mapping policy: 0200
>   Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSE
>   Num new daemons: 1  New daemon starting vpid 1
>   Num nodes: 2
> 
>  Data for node: Name: charlie   Launch id: -1 Arch: ffc91200  State: 2
>   Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
>   Daemon: [[57575,0],0] Daemon launched: True
>   Num slots: 4  Slots in use: 3
>   Num slots allocated: 4  Max slots: 0
>   Username on node: NULL
>   Num procs: 3  Next node_rank: 3
>   Data for proc: [[57575,1],0]
>     Pid: 0  Local rank: 0 Node rank: 0
>     State: 0  App_context: 0  Slot list: NULL
>   Data for proc: [[57575,1],2]
>     Pid: 0  Local rank: 1 Node rank: 1
>     State: 0  App_context: 0  Slot list: NULL
>   Data for proc: [[57575,1],4]
>     Pid: 0  Local rank: 2 Node rank: 2
>     State: 0  App_context: 0  Slot list: NULL
> 
>  Data for node: Name: carl.fft    Launch id: -1 Arch: 0 State: 2
>   Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
>   Daemon: [[57575,0],1] Daemon launched: False
>   Num slots: 2  Slots in use: 3
>   Num slots allocated: 2  Max slots: 0
>   Username on node: NULL
>   Num procs: 3  Next node_rank: 3
>   Data for proc: [[57575,1],1]
>     Pid: 0  Local rank: 0 Node rank: 0
>     State: 0  App_context: 0  Slot list: NULL
>   Data for proc: [[57575,1],3]
>     Pid: 0  Local rank: 1 Node rank: 1
>     State: 0  App_context: 0  Slot list: NULL
>   Data for proc: [[57575,1],5]
>     Pid: 0  Local rank: 2 Node rank: 2
>     State: 0  App_context: 0  Slot list: NULL
> 
>  
>  
> Regards,
> Eloi
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to