How are you launching the application? I had an app that did an Spawn_multiple with tight SGE integration, and there was a difference in behavior depending on whether or not an app was launched via mpiexec. I¹m not sure whether it¹s the same issue as you¹re seeing, but Reuti describes the problem here: http://www.open-mpi.org/community/lists/users/2012/01/18348.php
It will be resolved at some point, but I imagine that the fix will only go into new releases: http://www.open-mpi.org/community/lists/users/2012/02/18399.php In my case, the workaround was just to launch the app with mpiexec, and the allocation is handled correctly. ---Tom On 4/3/12 9:23 AM, "Eloi Gaudry" <eloi.gau...@fft.be> wrote: > Hi, > > > > I¹ve observed a strange behavior during rank allocation on a distributed run > schedule and submitted using Sge (Son of Grid Egine 8.0.0d) and OpenMPI-1.4.4. > > Briefly, there is a one-slot difference between allocated rank/slot for Sge > and OpenMPI. The issue here is that one node becomes oversubscribed at > runtime. > > > > Here is the output of the allocation done for gridengine: > > > > ====================== ALLOCATED NODES ====================== > > > > Data for node: Name: barney Launch id: -1 Arch: ffc91200 > State: 2 > > Num boards: 1 Num sockets/board: 2 Num cores/socket: 2 > > Daemon: [[22904,0],0] Daemon launched: True > > Num slots: 1 Slots in use: 0 > > Num slots allocated: 1 Max slots: 0 > > Username on node: NULL > > Num procs: 0 Next node_rank: 0 > > Data for node: Name: carl.fft Launch id: -1 Arch: 0 > State: 2 > > Num boards: 1 Num sockets/board: 2 Num cores/socket: 2 > > Daemon: Not defined Daemon launched: False > > Num slots: 1 Slots in use: 0 > > Num slots allocated: 1 Max slots: 0 > > Username on node: NULL > > Num procs: 0 Next node_rank: 0 > > Data for node: Name: charlie.fft Launch id: -1 > Arch: 0 State: 2 > > Num boards: 1 Num sockets/board: 2 Num cores/socket: 2 > > Daemon: Not defined Daemon launched: False > > Num slots: 2 Slots in use: 0 > > Num slots allocated: 2 Max slots: 0 > > Username on node: NULL > > Num procs: 0 Next node_rank: 0 > > > > > > And here is the allocation finally used: > > ================================================================= > > > > Map generated by mapping policy: 0200 > > Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSE > > Num new daemons: 2 New daemon starting vpid 1 > > Num nodes: 3 > > > > Data for node: Name: barney Launch id: -1 Arch: ffc91200 > State: 2 > > Num boards: 1 Num sockets/board: 2 Num cores/socket: 2 > > Daemon: [[22904,0],0] Daemon launched: True > > Num slots: 1 Slots in use: 2 > > Num slots allocated: 1 Max slots: 0 > > Username on node: NULL > > Num procs: 2 Next node_rank: 2 > > Data for proc: [[22904,1],0] > > Pid: 0 Local rank: 0 Node rank: 0 > > State: 0 App_context: 0 > Slot list: NULL > > Data for proc: [[22904,1],3] > > Pid: 0 Local rank: 1 Node rank: 1 > > State: 0 App_context: 0 > Slot list: NULL > > > > Data for node: Name: carl.fft Launch id: -1 Arch: 0 > State: 2 > > Num boards: 1 Num sockets/board: 2 Num cores/socket: 2 > > Daemon: [[22904,0],1] Daemon launched: False > > Num slots: 1 Slots in use: 1 > > Num slots allocated: 1 Max slots: 0 > > Username on node: NULL > > Num procs: 1 Next node_rank: 1 > > Data for proc: [[22904,1],1] > > Pid: 0 Local rank: 0 Node rank: 0 > > State: 0 App_context: 0 > Slot list: NULL > > > > Data for node: Name: charlie.fft Launch id: -1 > Arch: 0 State: 2 > > Num boards: 1 Num sockets/board: 2 Num cores/socket: 2 > > Daemon: [[22904,0],2] Daemon launched: False > > Num slots: 2 Slots in use: 1 > > Num slots allocated: 2 Max slots: 0 > > Username on node: NULL > > Num procs: 1 Next node_rank: 1 > > Data for proc: [[22904,1],2] > > Pid: 0 Local rank: 0 Node rank: 0 > > State: 0 App_context: 0 > Slot list: NULL > > > > Has anyone already encounter the same behavior ? > > Is there a simple fix than not using the tight integration mode between Sge > and OpenMPI ? > > > > Eloi > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users