Hi,

Am 17.01.2012 um 20:41 schrieb Jeff Dusenberry:

> I have a subordinate queue set up with notification time of 5 minutes,
> and preempted jobs are terminated (using SIGTERM) after that period.
> For jobs running in that queue, I've been able to confirm that there
> is a 5 minute delay between when the notification is sent and when the
> job is terminated.  The idea is to give the job a chance to save state
> and shut itself down cleanly before being terminated.
> 
> The issue that I've been running into is that the job that triggers
> the preemption begins running when the notification signal is sent.
> We then end up with both jobs running simultaneously during the
> notification period.  Is there any way to delay that second job so it
> will not start until the preempted job has either exited on its own or
> been killed?  Any suggestions for how I might configure this
> differently would be appreciated.

Well, SGE can't look ahead. So you allow already an oversubscription in memory 
and/or slots I assume. And you defined a suspend_method to checkpoint and kill 
the suspended job?

It depends on your setup, but when you have an oversubscription in slots for a 
short time, you could define a "starter_method" which will check `qcconf -F 
slots -h foobar` twice a minute or so and wait if it's still above the defined 
cores on this machine.

BTW: you could also submit the to be preempted jobs with a checkpointing 
interface "application_level" and define the checkpointing and killing the 
processgroup in the "migr_command" defined script. Then the preempted job is 
still on top of the waiting again list instead removing it completely and 
submitting it again.

-- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to