Hi,

Am 15.05.2012 um 22:59 schrieb Jake Carroll:

> A couple of quick questions this morning with some ROCKS/SGE scheduler 
> semantics.
> 
>       • I've got some new users who want to drive the cluster we have set up 
> with the very maximum efficiency possible. I.e – a user can use as much of 
> the cluster that is possible when they submit a job. With over 1000 cores  
> but many users, one of the things we did do was limit a users ability to take 
> up more than about 300 or 400 slots, such that they could only ever utilise 
> maybe 20 to 30% of the cluster at any given time. My new users don't like 
> this –and they want to be able to use 100% of the system, if it's free and no 
> other jobs are running.

How long do their jobs run? Once a job is entitled to run by SGE, it will stay 
in this state. Slo they could block the cluster for some time once their jobs 
started.


> Now, my understanding is that we could definitely remove that limit of 300 or 
> 400 slots/jobs, but it'll have a couple of detrimental impacts:
> Primarily –it'll preclude any other user from starting jobs at any given time 
> if their jobs are running, as there are no free slots.

Yes. Maybe it would work for you: a single user can only other 80% of the 
cluster, even if there are free cores.


> 2. My users told me "no, no – you can simply put our jobs "to sleep" when 
> others in the queue log in to run their jobs.
> 
> Now, my understanding of that is, yes, that is possible (though, I don't know 
> how it's implemented – fairshare policy queue / weight perhaps?) BUT it has 
> the big drawback that when a users job is "asleep", it will actually still 
> keep ahold of the memory allocation on the node, thus, if another big mem job 
> comes along and the node is memory over-subscribed, crashing scenarios will 
> ensue! Can somebody confirm that kind of functionality/concern for me?

Correct. Used resources remain used. But you could allow some swap space, and 
swapping out the sleeping job would only occur once.


> 3. My users want jobs to "persist" over the course of a cluster head node 
> crash. Would I be right in saying that it's only possible to persist across 
> crashes if the users are using CHECKPOINTING in their jobs? I've heard of it 
> before –just never implemented it and don't know where to start.

You can reboot the the head node of an SGE cluster which runs the qmaster any 
time you like. The jobs aren't affected at all. For this functionality no 
checkpointing is necessary. Checkpointng (if already supported by the 
application outside of SGE) would allow a crash of an exechost.

-- Reuti


> Thank you for your time, all.
> 
> JC
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to