Hi, I've observed a strange behavior during rank allocation on a distributed run schedule and submitted using Sge (Son of Grid Egine 8.0.0d) and OpenMPI-1.4.4.
Briefly, there is a one-slot difference between allocated rank/slot for Sge and OpenMPI. The issue here is that one node becomes oversubscribed at runtime. Here is the output of the allocation done for gridengine: ====================== ALLOCATED NODES ====================== Data for node: Name: barney Launch id: -1 Arch: ffc91200 State: 2 Num boards: 1 Num sockets/board: 2 Num cores/socket: 2 Daemon: [[22904,0],0] Daemon launched: True Num slots: 1 Slots in use: 0 Num slots allocated: 1 Max slots: 0 Username on node: NULL Num procs: 0 Next node_rank: 0 Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2 Num boards: 1 Num sockets/board: 2 Num cores/socket: 2 Daemon: Not defined Daemon launched: False Num slots: 1 Slots in use: 0 Num slots allocated: 1 Max slots: 0 Username on node: NULL Num procs: 0 Next node_rank: 0 Data for node: Name: charlie.fft Launch id: -1 Arch: 0 State: 2 Num boards: 1 Num sockets/board: 2 Num cores/socket: 2 Daemon: Not defined Daemon launched: False Num slots: 2 Slots in use: 0 Num slots allocated: 2 Max slots: 0 Username on node: NULL Num procs: 0 Next node_rank: 0 And here is the allocation finally used: ================================================================= Map generated by mapping policy: 0200 Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSE Num new daemons: 2 New daemon starting vpid 1 Num nodes: 3 Data for node: Name: barney Launch id: -1 Arch: ffc91200 State: 2 Num boards: 1 Num sockets/board: 2 Num cores/socket: 2 Daemon: [[22904,0],0] Daemon launched: True Num slots: 1 Slots in use: 2 Num slots allocated: 1 Max slots: 0 Username on node: NULL Num procs: 2 Next node_rank: 2 Data for proc: [[22904,1],0] Pid: 0 Local rank: 0 Node rank: 0 State: 0 App_context: 0 Slot list: NULL Data for proc: [[22904,1],3] Pid: 0 Local rank: 1 Node rank: 1 State: 0 App_context: 0 Slot list: NULL Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2 Num boards: 1 Num sockets/board: 2 Num cores/socket: 2 Daemon: [[22904,0],1] Daemon launched: False Num slots: 1 Slots in use: 1 Num slots allocated: 1 Max slots: 0 Username on node: NULL Num procs: 1 Next node_rank: 1 Data for proc: [[22904,1],1] Pid: 0 Local rank: 0 Node rank: 0 State: 0 App_context: 0 Slot list: NULL Data for node: Name: charlie.fft Launch id: -1 Arch: 0 State: 2 Num boards: 1 Num sockets/board: 2 Num cores/socket: 2 Daemon: [[22904,0],2] Daemon launched: False Num slots: 2 Slots in use: 1 Num slots allocated: 2 Max slots: 0 Username on node: NULL Num procs: 1 Next node_rank: 1 Data for proc: [[22904,1],2] Pid: 0 Local rank: 0 Node rank: 0 State: 0 App_context: 0 Slot list: NULL Has anyone already encounter the same behavior ? Is there a simple fix than not using the tight integration mode between Sge and OpenMPI ? Eloi