Re: [OMPI users] sge tight intregration leads to bad allocation

Reuti Tue, 3 Apr 2012 11:30:16 -0400

Am 03.04.2012 um 17:24 schrieb Eloi Gaudry:

> -----Original Message-----
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Reuti
> Sent: mardi 3 avril 2012 17:13
> To: Open MPI Users
> Subject: Re: [OMPI users] sge tight intregration leads to bad allocation
> 
> Am 03.04.2012 um 16:59 schrieb Eloi Gaudry:
> 
>> Hi Reuti,
>> 
>> I configured OpenMPI to support SGE tight integration and used the defined 
>> below PE for submitting the job:
>> 
>> [16:36][eg@moe:~]$ qconf -sp fill_up
>> pe_name            fill_up
>> slots              80
>> user_lists         NONE
>> xuser_lists        NONE
>> start_proc_args    /bin/true
>> stop_proc_args     /bin/true
>> allocation_rule    $fill_up
> 
> It should fill a host completely before moving to the next one with this 
> definition.
> [eg: ] yes, and it should also make sure that all hard requirements are met. 
> Note that the allocation done by sge is correct here, this is what is finally 
> done by openmpi at startup that is different (and incorrect).
> 
> 
>> control_slaves     TRUE
>> job_is_first_task  FALSE
>> urgency_slots      min
>> accounting_summary FALSE
>> 
>> Here are the allocation info retrieved from `qstat -g t` for the related job:
> 
> For me the output of `qstat -g t` shows MASTER and SLAVE entries but no 
> variables. Is there any wrapper defined for `qstat` to reformat the output 
> (or a ~/.sge_qstat defined)?
> 
> [eg: ] sorry, i forgot about sge_qstat being defined. As I don't have any 
> slot available right now, I cannot relaunch the job to get the output updated.
> 
> And why is "num_proc=0" output everywhere - was it redefined (usually it's a 
> load sensor set to the found cores in the machines and shoudn't be touched by 
> hand making it a consumable complex).
> 
> [eg: ] my mistake i think, this was made a consumable complex so that we 
> could easily schedule multithread and parallel job on the cluster. I guess I 
> should define another complex (proc_available), make it consumable and 
> consume from this complex instead of touching the num_proc sensor one then...


No. Also a threaded job is a parallel one with allocation_rule $pe_slots, no 
custom complex necessary. Often such a PE is called "smp".

So, for now we can't solve the initial issue.

-- Reuti


> 
> -- Reuti
> 
> 
>> ---------------------------------------------------------------------------------
>> smp...@barney.fft              BIP   0/1/4          0.70     lx-amd64
>>       hc:num_proc=0
>>       hl:mem_free=31.215G
>>       hl:mem_used=280.996M
>>       hc:mem_available=1.715G
>>  1296 0.54786 semi_direc jj           r     04/03/2012 16:43:49     1
>> ---------------------------------------------------------------------------------
>> smp...@carl.fft                BIP   0/1/4          0.69     lx-amd64
>>       hc:num_proc=0
>>       hl:mem_free=30.764G
>>       hl:mem_used=742.805M
>>       hc:mem_available=1.715G
>>  1296 0.54786 semi_direc jj           r     04/03/2012 16:43:49     1
>> ---------------------------------------------------------------------------------
>> smp...@charlie.fft             BIP   0/2/8          0.57     lx-amd64
>>       hc:num_proc=0
>>       hl:mem_free=62.234G
>>       hl:mem_used=836.797M
>>       hc:mem_available=4.018G
>>  1296 0.54786 semi_direc jj           r     04/03/2012 16:43:49     2
>> ----------------------------------------------------------------------
>> -----------
>> 
>> Sge reports whatr pls_gridengine_report does, i.e. what was reserved.
>> But here is the ouput of the current job (after started by openmpi):
>> [charlie:05294] ras:gridengine: JOB_ID: 1296 [charlie:05294] 
>> ras:gridengine: PE_HOSTFILE: 
>> /opt/sge/default/spool/charlie/active_jobs/1296.1/pe_hostfile
>> [charlie:05294] ras:gridengine: charlie.fft: PE_HOSTFILE shows slots=2 
>> [charlie:05294] ras:gridengine: barney.fft: PE_HOSTFILE shows slots=1 
>> [charlie:05294] ras:gridengine: carl.fft: PE_HOSTFILE shows slots=1
>> 
>> ======================   ALLOCATED NODES   ======================
>> 
>> Data for node: Name: charlie   Launch id: -1 Arch: ffc91200  State: 2
>> Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
>> Daemon: [[54347,0],0] Daemon launched: True  Num slots: 2  Slots in 
>> use: 0  Num slots allocated: 2  Max slots: 0  Username on node: NULL  
>> Num procs: 0  Next node_rank: 0
>> Data for node: Name: barney.fft    Launch id: -1 Arch: 0 State: 2
>> Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
>> Daemon: Not defined Daemon launched: False  Num slots: 1  Slots in 
>> use: 0  Num slots allocated: 1  Max slots: 0  Username on node: NULL  
>> Num procs: 0  Next node_rank: 0
>> Data for node: Name: carl.fft    Launch id: -1 Arch: 0 State: 2
>> Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
>> Daemon: Not defined Daemon launched: False  Num slots: 1  Slots in 
>> use: 0  Num slots allocated: 1  Max slots: 0  Username on node: NULL  
>> Num procs: 0  Next node_rank: 0
>> 
>> =================================================================
>> 
>> Map generated by mapping policy: 0200
>> Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSE  Num new 
>> daemons: 2  New daemon starting vpid 1  Num nodes: 3
>> 
>> Data for node: Name: charlie   Launch id: -1 Arch: ffc91200  State: 2
>> Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
>> Daemon: [[54347,0],0] Daemon launched: True  Num slots: 2  Slots in 
>> use: 2  Num slots allocated: 2  Max slots: 0  Username on node: NULL  
>> Num procs: 2  Next node_rank: 2  Data for proc: [[54347,1],0]
>>   Pid: 0  Local rank: 0 Node rank: 0
>>   State: 0  App_context: 0  Slot list: NULL  Data for proc: 
>> [[54347,1],3]
>>   Pid: 0  Local rank: 1 Node rank: 1
>>   State: 0  App_context: 0  Slot list: NULL
>> Data for node: Name: barney.fft    Launch id: -1 Arch: 0 State: 2
>> Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
>> Daemon: [[54347,0],1] Daemon launched: False  Num slots: 1  Slots in 
>> use: 1  Num slots allocated: 1  Max slots: 0  Username on node: NULL  
>> Num procs: 1  Next node_rank: 1  Data for proc: [[54347,1],1]
>>   Pid: 0  Local rank: 0 Node rank: 0
>>   State: 0  App_context: 0  Slot list: NULL
>> 
>> Data for node: Name: carl.fft    Launch id: -1 Arch: 0 State: 2
>> Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
>> Daemon: [[54347,0],2] Daemon launched: False  Num slots: 1  Slots in 
>> use: 1  Num slots allocated: 1  Max slots: 0  Username on node: NULL  
>> Num procs: 1  Next node_rank: 1  Data for proc: [[54347,1],2]
>>   Pid: 0  Local rank: 0 Node rank: 0
>>   State: 0  App_context: 0  Slot list: NULL
>> 
>> Regards,
>> Eloi
>> 
>> 
>> 
>> -----Original Message-----
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] 
>> On Behalf Of Reuti
>> Sent: mardi 3 avril 2012 16:24
>> To: Open MPI Users
>> Subject: Re: [OMPI users] sge tight intregration leads to bad 
>> allocation
>> 
>> Hi,
>> 
>> Am 03.04.2012 um 16:12 schrieb Eloi Gaudry:
>> 
>>> Thanks for your feedback.
>>> No, this is the other way around, the "reserved" slots on all nodes are ok 
>>> but the "used" slots are different.
>>> 
>>> Basically, I'm using SGE to schedule and book resources for a distributed 
>>> job. When the job is finally launched, it uses a different allocation than 
>>> the one that was reported by pls_gridengine_info.
>>> 
>>> pls_grid_engine_info report states that 3 nodes were booked: barney (1 
>>> slot), carl (1 slot) and charlie (2 slots). This booking was done by sge 
>>> depending on the memory requirements of the job (among others).
>>> 
>>> When orterun starts the jobs (i.e. when Sge finally start the scheduled 
>>> job), it uses 3 nodes but the first one (barney: 2 slots instead of 1) is 
>>> oversubscribed and the last one (carl: 1 slot instead of 2) is underused.
>> 
>> you configured Open MPI to support SGE tight integration and used a PE for 
>> submitting the job? Can you please post the defintion of the PE.
>> 
>> What was the allocation you saw in SGE's `qstat -g t ` for the job?
>> 
>> -- Reuti
>> 
>> 
>>> If you need further information, please let me know.
>>> 
>>> Eloi
>>> 
>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] 
>>> On Behalf Of Ralph Castain
>>> Sent: mardi 3 avril 2012 15:58
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] sge tight intregration leads to bad 
>>> allocation
>>> 
>>> I'm afraid there isn't enough info here to help. Are you saying you only 
>>> allocated one slot/node, so the two slots on charlie is in error?
>>> 
>>> Sent from my iPad
>>> 
>>> On Apr 3, 2012, at 6:23 AM, "Eloi Gaudry" <eloi.gau...@fft.be> wrote:
>>> 
>>> Hi,
>>> 
>>> I've observed a strange behavior during rank allocation on a distributed 
>>> run schedule and submitted using Sge (Son of Grid Egine 8.0.0d) and 
>>> OpenMPI-1.4.4.
>>> Briefly, there is a one-slot difference between allocated rank/slot for Sge 
>>> and OpenMPI. The issue here is that one node becomes oversubscribed at 
>>> runtime.
>>> 
>>> Here is the output of the allocation done for gridengine:
>>> 
>>> ======================   ALLOCATED NODES   ======================
>>> 
>>> Data for node: Name: barney                 Launch id: -1      Arch: 
>>> ffc91200   State: 2
>>>              Num boards: 1  Num sockets/board: 2  Num cores/socket: 2
>>>              Daemon: [[22904,0],0]  Daemon launched: True
>>>              Num slots: 1      Slots in use: 0
>>>              Num slots allocated: 1   Max slots: 0
>>>              Username on node: NULL
>>>              Num procs: 0     Next node_rank: 0
>>> Data for node: Name: carl.fft                  Launch id: -1      Arch: 0  
>>> State: 2
>>>              Num boards: 1  Num sockets/board: 2  Num cores/socket: 2
>>>              Daemon: Not defined   Daemon launched: False
>>>              Num slots: 1      Slots in use: 0
>>>              Num slots allocated: 1   Max slots: 0
>>>              Username on node: NULL
>>>              Num procs: 0     Next node_rank: 0
>>> Data for node: Name: charlie.fft                            Launch id: -1   
>>>    Arch: 0  State: 2
>>>              Num boards: 1  Num sockets/board: 2  Num cores/socket: 2
>>>              Daemon: Not defined   Daemon launched: False
>>>              Num slots: 2      Slots in use: 0
>>>              Num slots allocated: 2   Max slots: 0
>>>              Username on node: NULL
>>>              Num procs: 0     Next node_rank: 0
>>> 
>>> 
>>> And here is the allocation finally used:
>>> =================================================================
>>> 
>>> Map generated by mapping policy: 0200
>>>              Npernode: 0      Oversubscribe allowed: TRUE   CPU Lists: FALSE
>>>              Num new daemons: 2  New daemon starting vpid 1
>>>              Num nodes: 3
>>> 
>>> Data for node: Name: barney                 Launch id: -1      Arch: 
>>> ffc91200   State: 2
>>>              Num boards: 1  Num sockets/board: 2  Num cores/socket: 2
>>>              Daemon: [[22904,0],0]  Daemon launched: True
>>>              Num slots: 1      Slots in use: 2
>>>              Num slots allocated: 1   Max slots: 0
>>>              Username on node: NULL
>>>              Num procs: 2     Next node_rank: 2
>>>              Data for proc: [[22904,1],0]
>>>                             Pid: 0     Local rank: 0       Node rank: 0
>>>                             State: 0                App_context: 0          
>>>       Slot list: NULL
>>>              Data for proc: [[22904,1],3]
>>>                             Pid: 0     Local rank: 1       Node rank: 1
>>>                             State: 0                App_context: 0          
>>>       Slot list: NULL
>>> 
>>> Data for node: Name: carl.fft                  Launch id: -1      Arch: 0  
>>> State: 2
>>>              Num boards: 1  Num sockets/board: 2  Num cores/socket: 2
>>>              Daemon: [[22904,0],1]  Daemon launched: False
>>>              Num slots: 1      Slots in use: 1
>>>              Num slots allocated: 1   Max slots: 0
>>>              Username on node: NULL
>>>              Num procs: 1     Next node_rank: 1
>>>              Data for proc: [[22904,1],1]
>>>                             Pid: 0     Local rank: 0       Node rank: 0
>>>                             State: 0                App_context: 0          
>>>       Slot list: NULL
>>> 
>>> Data for node: Name: charlie.fft                            Launch id: -1   
>>>    Arch: 0  State: 2
>>>              Num boards: 1  Num sockets/board: 2  Num cores/socket: 2
>>>              Daemon: [[22904,0],2]  Daemon launched: False
>>>              Num slots: 2      Slots in use: 1
>>>              Num slots allocated: 2   Max slots: 0
>>>              Username on node: NULL
>>>              Num procs: 1     Next node_rank: 1
>>>              Data for proc: [[22904,1],2]
>>>                             Pid: 0     Local rank: 0       Node rank: 0
>>>                             State: 0                App_context: 0          
>>>       Slot list: NULL
>>> 
>>> Has anyone already encounter the same behavior ?
>>> Is there a simple fix than not using the tight integration mode between Sge 
>>> and OpenMPI ?
>>> 
>>> Eloi
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] sge tight intregration leads to bad allocation

Reply via email to