Hi,

 
I've observed a strange behavior during rank allocation on a distributed run 
schedule and submitted using Sge (Son of Grid Egine 8.0.0d) and OpenMPI-1.4.4.

Briefly, there is a one-slot difference between allocated rank/slot for Sge and 
OpenMPI. The issue here is that one node becomes oversubscribed at runtime.

 
Here is the output of the allocation done for gridengine:

 
======================   ALLOCATED NODES   ======================

 
Data for node: Name: barney                 Launch id: -1      Arch: ffc91200   
State: 2

               Num boards: 1  Num sockets/board: 2  Num cores/socket: 2

               Daemon: [[22904,0],0]  Daemon launched: True

               Num slots: 1      Slots in use: 0

               Num slots allocated: 1   Max slots: 0

               Username on node: NULL

               Num procs: 0     Next node_rank: 0

Data for node: Name: carl.fft                  Launch id: -1      Arch: 0  
State: 2

               Num boards: 1  Num sockets/board: 2  Num cores/socket: 2

               Daemon: Not defined   Daemon launched: False

               Num slots: 1      Slots in use: 0

               Num slots allocated: 1   Max slots: 0

               Username on node: NULL

               Num procs: 0     Next node_rank: 0

Data for node: Name: charlie.fft                            Launch id: -1      
Arch: 0  State: 2

               Num boards: 1  Num sockets/board: 2  Num cores/socket: 2

               Daemon: Not defined   Daemon launched: False

               Num slots: 2      Slots in use: 0

               Num slots allocated: 2   Max slots: 0

               Username on node: NULL

               Num procs: 0     Next node_rank: 0

 
 
And here is the allocation finally used:

=================================================================

 
Map generated by mapping policy: 0200

               Npernode: 0      Oversubscribe allowed: TRUE   CPU Lists: FALSE

               Num new daemons: 2  New daemon starting vpid 1

               Num nodes: 3

 
Data for node: Name: barney                 Launch id: -1      Arch: ffc91200   
State: 2

               Num boards: 1  Num sockets/board: 2  Num cores/socket: 2

               Daemon: [[22904,0],0]  Daemon launched: True

               Num slots: 1      Slots in use: 2

               Num slots allocated: 1   Max slots: 0

               Username on node: NULL

               Num procs: 2     Next node_rank: 2

               Data for proc: [[22904,1],0]

                              Pid: 0     Local rank: 0       Node rank: 0

                              State: 0                App_context: 0            
    Slot list: NULL

               Data for proc: [[22904,1],3]

                              Pid: 0     Local rank: 1       Node rank: 1

                              State: 0                App_context: 0            
    Slot list: NULL

 
Data for node: Name: carl.fft                  Launch id: -1      Arch: 0  
State: 2

               Num boards: 1  Num sockets/board: 2  Num cores/socket: 2

               Daemon: [[22904,0],1]  Daemon launched: False

               Num slots: 1      Slots in use: 1

               Num slots allocated: 1   Max slots: 0

               Username on node: NULL

               Num procs: 1     Next node_rank: 1

               Data for proc: [[22904,1],1]

                              Pid: 0     Local rank: 0       Node rank: 0

                              State: 0                App_context: 0            
    Slot list: NULL

 
Data for node: Name: charlie.fft                            Launch id: -1      
Arch: 0  State: 2

               Num boards: 1  Num sockets/board: 2  Num cores/socket: 2

               Daemon: [[22904,0],2]  Daemon launched: False

               Num slots: 2      Slots in use: 1

               Num slots allocated: 2   Max slots: 0

               Username on node: NULL

               Num procs: 1     Next node_rank: 1

               Data for proc: [[22904,1],2]

                              Pid: 0     Local rank: 0       Node rank: 0

                              State: 0                App_context: 0            
    Slot list: NULL

 
Has anyone already encounter the same behavior ?

Is there a simple fix than not using the tight integration mode between Sge and 
OpenMPI ?

 
Eloi

 

Reply via email to