Hi,

> Am 09.03.2015 um 14:02 schrieb Rafael Arco Arredondo <rafaa...@ugr.es>:
> 
> Hello again,
> 
> Now we have a different situation. I'll try to explain it as best as I
> can. We have two different sets of nodes, A and B, A with 16 cores per
> host and B with 32. We want to put all the nodes in only one cluster
> queue, which has configured two parallel environments as well, pe16 and
> pe32, pe16 with an allocation_rule of 16 and pe32 with 32. The queue has
> the following definition for the number of slots:

Before going into detail: did you attach both PEs to all machines, or was it 
split too like the slots:

pe_list pe16,[@B=pe32]

?

-- Reuti


> slots                 16,[@B=32]
> 
> After submitting several jobs with 32 slots to the queue and pe16 as
> parallel environment, we observed that the first job starts running on
> two of the nodes -not only one- of the B set, when it could run in just
> one node. I understand this is because GE allocates the first 16 slots
> as a block, and then it uses the least loaded machine for the next 16
> (which, of course, is not the first machine). Then, the second job used
> 2 nodes of the A set, which is the behavior that we want.
> 
> Of course, the easiest solution is two create two queues, but we don't
> want to do that. Is there any way to avoid the situation that happened
> with the first job? Ideally, we would like to be able to choose the set
> of nodes depending on the parallel environment (that is, if the user
> indicates pe16 then the job is run on set A, and if the user indicates
> pe32 the job is run on set B). I imagine the orthodox way of doing this
> is with two different cluster queues, but I want to make sure it's not
> possible with only one queue.
> 
> We would also like to have more information about how GE selects the
> nodes in this type of situations, i.e. when the allocation rule is a
> fixed number and not fill_up or round_robin. Does it select the least
> loaded node? Or is it the most loaded one? Is there any detailed
> documentation about this? And besides, why does the fill_up rule
> sometimes leave fragmented nodes (that is, when -pe pe16 16 is used and
> 2 nodes are selected, each with 8 slots), even when there are no jobs
> using less than 16 slots? Perhaps because a user submits a job with,
> let's say, 8 slots and it is aborted with error state (Eqw) after
> failing in the prolog (where we control the number of slots be multiple
> of 16), and then the selected slots are kept reserved as if they are
> effectively used?
> 
> I hope everything is clear enough. Thanks again for your help and best
> regards,
> 
> Rafa
> 
> El lun, 22-09-2014 a las 12:13 +0200, Rafael Arco Arredondo escribió:
>> Thanks Reuti, we'll try the fixed allocation rules. We will have to
>> create one or two more parallel environments for the different machine
>> types, but I hope it won't disturb our users much.
>> 
>> Rafa
>> 
>> El jue, 18-09-2014 a las 15:46 +0200, Reuti escribió:
>>> Am 18.09.2014 um 14:52 schrieb Rafael Arco Arredondo:
>>> 
>>>> The hosts are free before running the jobs and are all identical in
>>>> terms of available resources.
>>>> 
>>>> Looking a bit deeper into the problem, it seems that sometimes jobs
>>>> requesting only 8 slots are executed on the nodes, overriding the check
>>>> in the prolog (which says the number of slots has to be multiple of 16).
>>> 
>>> In the prolog? At this point the exechost selection was already done. Do 
>>> you reschedule the job then?
>>> 
>>> You can define fixed allocation rules for each type of machine, i.e. like 
>>> "allocation_rule 8". Then a job requesting a multiple of 8 can select the 
>>> PE by a wildcard.
>>> 
>>> qsub -pe baz 32
>>> 
>>> will get 4 times 8 cores for sure
>>> 
>>> ===
>>> 
>>> To make it a little bit flexible:
>>> 
>>> PE: foo_2
>>> allocation_rule 2
>>> 
>>> PE: foo_24
>>> allocation_rule 4
>>> 
>>> PE: foo_248
>>> allocation_rule 8
>>> 
>>> PE: foo_24816
>>> allocation_rule 16
>>> 
>>> Submission command:
>>> 
>>> qsub -pe foo* 16
>>> 
>>> => gets 16 slots from any of the PEs (but only from one PE once it's 
>>> selected), it could be 8 times 2 slots, or 4 times 4 slots
>>> 
>>> If you don't want too many nodes:
>>> 
>>> qsub -pe foo_248* 16
>>> 
>>> I want at least 8 per machine, i.e. 2 times 8 cores or one time 16 cores
>>> 
>>> 
>>> -- Reuti
>>> 
>>> 
>>>> It's like the prolog sometimes wasn't executed...
>>>> 
>>>> El jue, 18-09-2014 a las 14:01 +0200, Winkler, Ursula
>>>> (ursula.wink...@uni-graz.at) escribió:
>>>>> Hi Rafa,
>>>>> 
>>>>> Are the jobs scheduled on hosts where already other jobs are running (so 
>>>>> that only 8 slots are used on some hosts)? Or are all hosts free? Have 
>>>>> all nodes the same resources (i.e. slots, memory,...?) configured?
>>>>> 
>>>>> C., U.
>>>>> 
>>>>> -----Ursprüngliche Nachricht-----
>>>>> Von: Rafael Arco Arredondo [mailto:rafaa...@ugr.es] 
>>>>> Gesendet: Donnerstag, 18. September 2014 12:23
>>>>> An: Winkler, Ursula (ursula.wink...@uni-graz.at)
>>>>> Cc: users@gridengine.org
>>>>> Betreff: Re: AW: [gridengine users] Hosts not fully used with fill up
>>>>> 
>>>>> Thanks Ursula for your reply, but we need a parallel environment for more 
>>>>> than one node (we are using it with MPI). Sorry if I wasn't clear enough.
>>>>> 
>>>>> I give you an example. We have hosts with 16 slots. Now the user requests 
>>>>> 32 slots for a job, but instead of allocating slots on two nodes, 
>>>>> sometimes four nodes are used, each with 8 slots. We don't know why this 
>>>>> is happening, but it didn't happen with 6.2 with a similar configuration.
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Rafa
>>>>> 
>>>>> El jue, 18-09-2014 a las 12:11 +0200, Winkler, Ursula
>>>>> (ursula.wink...@uni-graz.at) escribió:
>>>>>> Hi Rafa,
>>>>>> 
>>>>>> for such purposes we have configured a separate Parallel Environment 
>>>>>> with "allocation_rule" "$pe_slots" (instead of "$fill_up"). Jobs 
>>>>>> scheduled with this rule can run ONLY on one host.
>>>>>> 
>>>>>> Regards,
>>>>>> Usula
>>>>>> 
>>>>>> -----Ursprüngliche Nachricht-----
>>>>>> Von: users-boun...@gridengine.org 
>>>>>> [mailto:users-boun...@gridengine.org] Im Auftrag von Rafael Arco 
>>>>>> Arredondo
>>>>>> Gesendet: Donnerstag, 18. September 2014 09:44
>>>>>> An: users@gridengine.org
>>>>>> Betreff: [gridengine users] Hosts not fully used with fill up
>>>>>> 
>>>>>> Hello everyone,
>>>>>> 
>>>>>> We are having an issue with the parallel environments and the allocation 
>>>>>> of slots with the fill up policy.
>>>>>> 
>>>>>> Although we have configured the resource quotas of the queues not to use 
>>>>>> more than the number of slots the machine have and we control in the 
>>>>>> prolog that the jobs be submitted with a number of slots multiple of the 
>>>>>> number of physical processors, we are observing that sometimes, the 
>>>>>> slots of a job are split into several nodes, when they should be running 
>>>>>> in only one node.
>>>>>> 
>>>>>> We are using Open Grid Scheduler 2011.11p1. This didn't happen in SGE 
>>>>>> 6.2.
>>>>>> 
>>>>>> Has anyone experienced the same situation? Any clues of why it is 
>>>>>> happening?
>>>>>> 
>>>>>> Thanks in advance,
>>>>>> 
>>>>>> Rafa
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users@gridengine.org
>>>>>> https://gridengine.org/mailman/listinfo/users
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> users@gridengine.org
>>>> https://gridengine.org/mailman/listinfo/users
>>>> 
>>> 
>> 
> 
> 
> -- 
> Rafael Arco Arredondo
> Centro de Servicios de Informática y Redes de Comunicaciones
> Campus de Fuentenueva - Edificio Mecenas
> Universidad de Granada
> E-18071 Granada Spain
> Tel: +34 958 241440   Ext:41440   E-mail: rafaa...@ugr.es
> ===============
> "Este mensaje se dirige exclusivamente a su destinatario y puede
> contener información privilegiada o confidencial. Si no es Ud. el
> destinatario indicado, queda notificado de que la utilización,
> divulgación o copia sin autorización está prohibida en virtud de la
> legislación vigente. Si ha recibido este mensaje por error, se ruega lo
> comunique inmediatamente por esta misma vía y proceda a su destrucción.
> 
> This message is intended exclusively for its addressee and may contain
> information that is CONFIDENTIAL and protected by professional
> privilege. If you are not the intended recipient you are hereby notified
> that any dissemination, copy or disclosure of this communication is
> strictly prohibited by law. If this message has been received in error,
> please immediately notify us via e-mail and delete it".
> ================
> 


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to