Hi,

> Am 23.10.2018 um 20:31 schrieb Dj Merrill <s...@deej.net>:
> 
> Hi Reuti,
>       Thank you for your response.  I didn't describe our environment very
> well, and I apologize.  We only have one queue.  We've had a few
> instances of people forgetting they ran a job that doesn't apparently
> have any stopping conditions, and am trying to come up with a way to
> gently remind folks when they've left something running.
> 
>       Current thoughts are to have the "sge_request" file contain:
> -soft -l s_rt=720:0:0
> 
>       We can tell them to use qalter to extend the time if they want, or they
> can contact us to do it.

This won't work in SGE. The limits are set when the job starts. The only way to 
extend a runtime limit is to -softstop the execd on the particular node (with 
the side effect that no more jobs will scheduled thereto until is is 
restarted). And restart the execd once the job granted to run longer than 
exstimated came to an end. 


>       It would be nice if we could somehow parse the current s_rt on a job,
> and 5 days before that time send out an email notification.  If they
> extend it to longer, we'd like it to again send out the notification 5
> days before the new limit.  In other words, something along the lines of
> running a cron script every night that parses the running jobs, gets the
> relevant info, and sends out an email notification if necessary.
> 
>       In fact, we might not even need the s_rt limit set at all and an email
> reminder at set intervals might be enough for our purposes, although
> being able to have it auto terminate the job would save some manual effort.

I would sugguest to store such arbitrary information in a job context like 
"qsub -ac ESTIMATED_RUNTIME=720". Reading your complete description of the set 
up, I get the impression that we are speaking here of jobs running for days or 
weeks. Hence a cronjob on the master node of the cluster could do all once per 
hour of every 10 minutes:

- read the job context and grep for the current set maximum duration
- generate emails when a certain limits is passed, and store the information 
that the email was send already in the job context too*
- a job that passed the limit will be killed

*) This additional context variable "WARNED_FOR=…" could simply get the same 
value as the just passed limit. As long as "ESTIMATED_RUNTIME" equals 
"WARNED_FOR" no additional email is generated. But if the user changes the 
"ESTIMATED_RUNTIME" we can detect this and an email can be send if the adjusted 
"ESTIMATED_RUNTIME" is about to be reached again. It might be easier, to have a 
wrapper around to convert hh:mm:ss to plain seconds or even advice the user to 
specify the limit in minutes or hours only as a general requirement. Hence no 
further conversion is necessary in the script.

I wonder how we can pull all information in one `qstat` call. The context 
variables of the running jobs you get with `qstat -s r -j "*"`, but the actual 
start of the job is output only in a plain `qstat -s r` or `qstat -s r -r`. To 
lower the impact on the qmaster we should not use a loop covering all currently 
running jobs one after the other only.

-- Reuti

> 
>       What I'm asking for might not even be practical, but I thought it worth
> a try to ask.
> 
> Thanks,
> 
> -Dj
> 
> 
> 
> On 10/20/2018 05:02 AM, Reuti wrote:
>> Hi,
>> 
>> Am 19.10.2018 um 22:44 schrieb Dj Merrill:
>> 
>>> Hi all,
>>>     Assuming a soft run time limit for a queue, is there a way to send an
>>> email warning when the job is about to hit the limit?
>>> 
>>>     For example, for a job with "-soft -l s_rt=720:0:0" giving a 30 day run
>> 
>> You are aware, that this is a soft-soft limit. Means: I prefer a queue with 
>> a s_rt of 720:0:0, and if I get only 360:0:0 it's also fine.
>> 
>> 
>>> time, is there a way to send an email at the 25 day mark to let the
>>> person know the job will be forced to end in 5 days?
>> 
>> The s_rt will have already the purpose to send a signal (SIGUSR1) before 
>> h_rt is reached. Please have a look at "RESOURCE LIMITS" in `man 
>> queue_conf`. So I wonder, whether the combined usage of s_rt and h_rt (both 
>> with the default -hard option) could already provide what you want to 
>> implement.
>> 
>> Sure, the SIGUSR1 must be caught in the script and masked out in the called 
>> binary to avoid that it's killed by the SIGUSR1 default behavior. I use a 
>> subshell for it:
>> 
>> trap "echo Foo" SIGUSR1
>> (trap - SIGUSR1; my_binary)
>> 
>> as the SIGUSR1 is send to the complete process tree of the job. The "echo 
>> Foo" could be replaced by`mail -s Warning …`.
>> 
>> 
>>>     I've thought about trying to draft a script to do this, but thought I'd
>>> ask first if anyone else has come up with something.
>> 
>> A completely different approach: use a checkpoint interface to send email a 
>> warning. The interval given to `qsub -c 600:0:0 -ckpt mailer_only …` 
>> represents the 25 days, and the checkpointing interface "mailer_only" does 
>> not do any real checkpointing, but has a script defined for "ckpt_command" 
>> which sends an email (i.e. "interface application-level" must be used).
>> 
>> There is an introduction to use the checkpoint interface here: 
>> https://arc.liv.ac.uk/SGE/howto/checkpointing.html
>> 
>> -- Reuti
>> 
> 
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
> 


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to