Re: [slurm-users] Job canceled after reaching QOS limits for CPU time.

Zacarias Benta Fri, 30 Oct 2020 08:04:50 -0700

Thanks Tom,

You are right it is suspend and not pendind that I would like the job state to go into.


I'll take a look into the *OverTimeLimit *flag and see if it helps.*
*


On 30/10/2020 14:10, Thomas M. Payerle wrote:

On Fri, Oct 30, 2020 at 5:37 AM Loris Bennett <loris.benn...@fu-berlin.de <mailto:loris.benn...@fu-berlin.de>> wrote:
    Hi Zacarias,

    Zacarias Benta <zacar...@lip.pt <mailto:zacar...@lip.pt>> writes:

    > Good morning everyone.
    >
    > I'm having a "issue", I don't know if it is a "bug or a feature".
    > I've created a QOS: "sacctmgr add qos myqos set GrpTRESMins=cpu=10
    > flags=NoDecay".  I know the limit it too low, but I just wanted to
    > give you guys an example.  Whenever a user submits a job and
    uses this
    > QOS, if the job reaches the limit I've defined, the job is canceled
    > and I loose and the computation it had done so far.  Is it
    possible to
    > create a QOS/slurm setting that when the users reach the limit, it
    > changes the job state to pending?  This way I can increase the
    limits,
    > change the job state to Runnig so it can continue until it reaches
    > completion.  I know this is a little bit odd, but I have users that
    > have requested cpu time as per an agreement between our HPC
    center and
    > their institutions. I know limits are set so they can be enforced,
    > what I'm trying to prevent is for example, a person having a job
    > running for 2 months and at the end not having any data because they
    > just needed a few more days. This could be prevented if I could
    grant
    > them a couple more days of cpu, if the job went on to a pending
    state
    > after reaching the limit.
Your "pending" suggestion does not really make sense. A pending job is no longer attached to a node, it is in the queue. It sounds like you are trying to "suspend" the job, e.g. ctrl-Z it in most shells, so that it is no longer using CPU. But even that would have it consuming RAM, which on many clusters would be a serious problem.
Slurm supports a "grace-period" for walltime., the OverTimeLimit parameter. I have not used it, but might be what you want. From web docs *OverTimeLimit* - Amount by which a job can exceed its time limit before it is killed. A system-wide configuration parameter. I believe if a job has a 1 day time limit, and OVerTimeLimit is 1 hour, the job effectively gets 25 hours before it is terminated.
You also should look into getting your users to checkpoint jobs (as hard as educating users is). I.e., jobs, especially large or long running jobs, should periodically save their state to a file. That way, if job is terminated before it is complete for any reason (from time limits to failed hardware to power outages, etc), it should be able to resume from the last checkpoint. So if job check points every 6 hours, it should not lose more than about 6 hours of runtime should it terminate prematurely. This sort of is the "pending" solution you referred to; the job dies, but can be restarted/requeued with additional time and more or less start up from where it left off. Some applications support checkpointing natively, and there are libraries/packages like dmtcp which can do more systemy checkpointing.
    I'm not sure there is a solution to your problem.  You want to both
    limit the time jobs can run and also not limit it.  How much more time
    do you want to give a job which has reached its limit?  A fixed
    time?  A
    percentage of the time used up to now?  What happens if two months
    plus
    a few more days is not enough and the job needs a few more days?

    The longer you allow jobs to run, the more CPU is lost when jobs
    fail to
    complete, the sadder users then are.  In addition the longer jobs run,
    the more likely they are to fall victim to hardware failure and
    the less
    able you are to perform administrative task which require a down-time.
    We run a university cluster with an upper time-limit of 14 days,
    which I
    consider fairly long, and occasionally extend individual jobs on a
    case-by-case basis.  For our users this seems to work fine.

    If your job need months, you are in general using the wrong software
    or using the software wrong.  There may be exceptions to this, but
    in my
    experience, these are few and far between.

    So my advice would be to try to convince your users that shorter
    run-times are in fact better for them and only by happy accident also
    better for you.

    Just my 2¢.

    Cheers,

    Loris

    >
    > Cumprimentos / Best Regards,
    >
    > Zacarias Benta
    > INCD @ LIP - Universidade do Minho
    >
    > INCD Logo
    >
-- Dr. Loris Bennett (Mr.)
    ZEDAT, Freie Universität Berlin         Email
    loris.benn...@fu-berlin.de <mailto:loris.benn...@fu-berlin.de>



--
Tom Payerle
DIT-ACIGS/Mid-Atlantic Crossroads paye...@umd.edu <mailto:paye...@umd.edu>
5825 University Research Park               (301) 405-6135
University of Maryland
College Park, MD 20740-3831

--

*Cumprimentos / Best Regards,*

Zacarias Benta
INCD @ LIP - Universidade do Minho

INCD Logo

smime.p7s
Description: S/MIME Cryptographic Signature

Re: [slurm-users] Job canceled after reaching QOS limits for CPU time.

Reply via email to