Re: [gridengine users] Some simple SGE questions concerning checkpointing and "sleeping" of jobs for queue equity.

Jake Carroll Tue, 15 May 2012 14:22:21 -0700

Hi Reuti,


Thank you for your responses. I appreciate your time. See in line:


On 16/05/12 7:14 AM, "Reuti" <[email protected]> wrote:

>Hi,
>
>Am 15.05.2012 um 22:59 schrieb Jake Carroll:
>
>> A couple of quick questions this morning with some ROCKS/SGE scheduler
>>semantics.
>> 
>>      € I've got some new users who want to drive the cluster we have set up
>>with the very maximum efficiency possible. I.e  a user can use as much
>>of the cluster that is possible when they submit a job. With over 1000
>>cores  but many users, one of the things we did do was limit a users
>>ability to take up more than about 300 or 400 slots, such that they
>>could only ever utilise maybe 20 to 30% of the cluster at any given
>>time. My new users don't like this and they want to be able to use 100%
>>of the system, if it's free and no other jobs are running.
>
>How long do their jobs run? Once a job is entitled to run by SGE, it will
>stay in this state. Slo they could block the cluster for some time once
>their jobs started.

They can indeed run for weeks.

>
>
>> Now, my understanding is that we could definitely remove that limit of
>>300 or 400 slots/jobs, but it'll have a couple of detrimental impacts:
>> Primarily it'll preclude any other user from starting jobs at any
>>given time if their jobs are running, as there are no free slots.
>
>Yes. Maybe it would work for you: a single user can only other 80% of the
>cluster, even if there are free cores.

Yup. We are considering just giving them more allocation, rather than
simply letting them use the entire cluster.

>
>
>> 2. My users told me "no, no  you can simply put our jobs "to sleep"
>>when others in the queue log in to run their jobs.
>> 
>> Now, my understanding of that is, yes, that is possible (though, I
>>don't know how it's implemented  fairshare policy queue / weight
>>perhaps?) BUT it has the big drawback that when a users job is "asleep",
>>it will actually still keep ahold of the memory allocation on the node,
>>thus, if another big mem job comes along and the node is memory
>>over-subscribed, crashing scenarios will ensue! Can somebody confirm
>>that kind of functionality/concern for me?
>
>Correct. Used resources remain used. But you could allow some swap space,
>and swapping out the sleeping job would only occur once.

OK. That is a good idea. Do you know of how I can do this? Is there a
specific part of the SGE man pages that I'll need to consult to learn
about how to enable "sleep" states when other users try to run jobs? Never
implemented that before.

>
>
>> 3. My users want jobs to "persist" over the course of a cluster head
>>node crash. Would I be right in saying that it's only possible to
>>persist across crashes if the users are using CHECKPOINTING in their
>>jobs? I've heard of it before just never implemented it and don't know
>>where to start.
>
>You can reboot the the head node of an SGE cluster which runs the qmaster
>any time you like. The jobs aren't affected at all. For this
>functionality no checkpointing is necessary. Checkpointng (if already
>supported by the application outside of SGE) would allow a crash of an
>exechost.

OK. That's important to know. So it's a function of the application
outside of SGE, and not so much SGE itself.

Thank you for your time.

JC

>
>-- Reuti
>
>
>> Thank you for your time, all.
>> 
>> JC
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Some simple SGE questions concerning checkpointing and "sleeping" of jobs for queue equity.

Reply via email to