How are you launching the application?

I had an app that did an Spawn_multiple with tight SGE integration, and
there was a difference in behavior depending on whether or not an app was
launched via mpiexec.  I¹m not sure whether it¹s the same issue as you¹re
seeing, but Reuti describes the problem here:
http://www.open-mpi.org/community/lists/users/2012/01/18348.php

It will be resolved at some point, but I imagine that the fix will only go
into new releases: 
http://www.open-mpi.org/community/lists/users/2012/02/18399.php

In my case, the workaround was just to launch the app with mpiexec, and the
allocation is handled correctly.

---Tom

On 4/3/12 9:23 AM, "Eloi Gaudry" <eloi.gau...@fft.be> wrote:

> Hi,
> 
>  
> 
> I¹ve observed a strange behavior during rank allocation on a distributed run
> schedule and submitted using Sge (Son of Grid Egine 8.0.0d) and OpenMPI-1.4.4.
> 
> Briefly, there is a one-slot difference between allocated rank/slot for Sge
> and OpenMPI. The issue here is that one node becomes oversubscribed at
> runtime.
> 
>  
> 
> Here is the output of the allocation done for gridengine:
> 
>  
> 
> ======================   ALLOCATED NODES   ======================
> 
>  
> 
> Data for node: Name: barney                 Launch id: -1      Arch: ffc91200
> State: 2
> 
>                Num boards: 1  Num sockets/board: 2  Num cores/socket: 2
> 
>                Daemon: [[22904,0],0]  Daemon launched: True
> 
>                Num slots: 1      Slots in use: 0
> 
>                Num slots allocated: 1   Max slots: 0
> 
>                Username on node: NULL
> 
>                Num procs: 0     Next node_rank: 0
> 
> Data for node: Name: carl.fft                  Launch id: -1      Arch: 0
> State: 2
> 
>                Num boards: 1  Num sockets/board: 2  Num cores/socket: 2
> 
>                Daemon: Not defined   Daemon launched: False
> 
>                Num slots: 1      Slots in use: 0
> 
>                Num slots allocated: 1   Max slots: 0
> 
>                Username on node: NULL
> 
>                Num procs: 0     Next node_rank: 0
> 
> Data for node: Name: charlie.fft                            Launch id: -1
> Arch: 0  State: 2
> 
>                Num boards: 1  Num sockets/board: 2  Num cores/socket: 2
> 
>                Daemon: Not defined   Daemon launched: False
> 
>                Num slots: 2      Slots in use: 0
> 
>                Num slots allocated: 2   Max slots: 0
> 
>                Username on node: NULL
> 
>                Num procs: 0     Next node_rank: 0
> 
>  
> 
>  
> 
> And here is the allocation finally used:
> 
> =================================================================
> 
>  
> 
> Map generated by mapping policy: 0200
> 
>                Npernode: 0      Oversubscribe allowed: TRUE   CPU Lists: FALSE
> 
>                Num new daemons: 2  New daemon starting vpid 1
> 
>                Num nodes: 3
> 
>  
> 
> Data for node: Name: barney                 Launch id: -1      Arch: ffc91200
> State: 2
> 
>                Num boards: 1  Num sockets/board: 2  Num cores/socket: 2
> 
>                Daemon: [[22904,0],0]  Daemon launched: True
> 
>                Num slots: 1      Slots in use: 2
> 
>                Num slots allocated: 1   Max slots: 0
> 
>                Username on node: NULL
> 
>                Num procs: 2     Next node_rank: 2
> 
>                Data for proc: [[22904,1],0]
> 
>                               Pid: 0     Local rank: 0       Node rank: 0
> 
>                               State: 0                App_context: 0
> Slot list: NULL
> 
>                Data for proc: [[22904,1],3]
> 
>                               Pid: 0     Local rank: 1       Node rank: 1
> 
>                               State: 0                App_context: 0
> Slot list: NULL
> 
>  
> 
> Data for node: Name: carl.fft                  Launch id: -1      Arch: 0
> State: 2
> 
>                Num boards: 1  Num sockets/board: 2  Num cores/socket: 2
> 
>                Daemon: [[22904,0],1]  Daemon launched: False
> 
>                Num slots: 1      Slots in use: 1
> 
>                Num slots allocated: 1   Max slots: 0
> 
>                Username on node: NULL
> 
>                Num procs: 1     Next node_rank: 1
> 
>                Data for proc: [[22904,1],1]
> 
>                               Pid: 0     Local rank: 0       Node rank: 0
> 
>                               State: 0                App_context: 0
> Slot list: NULL
> 
>  
> 
> Data for node: Name: charlie.fft                            Launch id: -1
> Arch: 0  State: 2
> 
>                Num boards: 1  Num sockets/board: 2  Num cores/socket: 2
> 
>                Daemon: [[22904,0],2]  Daemon launched: False
> 
>                Num slots: 2      Slots in use: 1
> 
>                Num slots allocated: 2   Max slots: 0
> 
>                Username on node: NULL
> 
>                Num procs: 1     Next node_rank: 1
> 
>                Data for proc: [[22904,1],2]
> 
>                               Pid: 0     Local rank: 0       Node rank: 0
> 
>                               State: 0                App_context: 0
> Slot list: NULL
> 
>  
> 
> Has anyone already encounter the same behavior ?
> 
> Is there a simple fix than not using the tight integration mode between Sge
> and OpenMPI ?
> 
>  
> 
> Eloi
> 
>  
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to