[slurm-users] Re: Implementing a "soft" wall clock limit

Davide DelVento via slurm-users Tue, 17 Jun 2025 05:16:49 -0700

This conversation is drifting a bit away from my initial questions and
covering various other related topics. In fact I do agree with almost
everything written in the last few messages. However, that is somewhat
orthogonal to my initial request, which I now understand has the answer
"not possible with slurm configuration, possible with ugly hacks which are
probably error prone and not worth the hassle". Just for the sake of the
discussion (since I'm enjoying hearing the various perspectives) I'll
restate my request and why I think slurm does not support this need.

Most clusters have very high utilization all the time. This is good for ROI
etc but annoying to users. Forcing users to specify a firm wallclock limit
helps slurm make good scheduling decisions, which keep utilization (ROI,
etc) high and minimizes wait time for everybody.

At the place where I work there is a quite different situation: there are
moments of high pressure and long wait, and there are moments in which its
utilization drops under 50% and sometimes even under 25% (e.g. during long
weekends). We can have a discussion about it, but the bottom line is that
management (ROI, etc) is fine with it, so that's the way it is. This
circumstance, I agree, is quite peculiar and not shared by any other place
I worked before or where I ever had an account and saw how things were, but
that is what it is. In this circumstance it feels at least silly and
perhaps even extremely wasteful and annoying to let slurm cancel jobs at
their wallclock limit without considering other context. I mean, imagine a
user with a weeklong job who estimated a 7 day wallclock limit and "for
good measure" requested 8 days, but then the job would actually take 9
days. Imagine that the 8th day happened in the middle of on a long weekend
when utilization was 25% and there was not a single other job pending.
Maybe this job is a one-off experiment quickly cobbled together to test one
thing, so it's not a well-designed piece of code and does not have
checkpoint-restart capabilities. Why enforce the wallclock limit in that
situation?

The way around this problem in the past was to simply not make
the wallclock limit mandatory (which was decided by my predecessor, who has
now left). That worked, only because the cluster was not in a very good
usability status so most people avoided it anyway and there seldom was a
long line of jobs pending in the queue, so slurm did not need to work very
hard to schedule things. Now that I've improved the usability situation,
this has become a problem, because utilization has become much higher.
Perhaps in a short time people will learn to plan ahead and submit more
jobs and fill the machine up during the weekends too (I'm working on user
education towards that), and if that happens, it will make the above
dilemma go away. But for now I have it.

I'm still mulling on how to best proceed. Maybe just force the users to set
a wallclock limit and live with it.

Here is another idea that just came to me. Does slurm have a "global"
switch to turn on/off cancelling jobs hitting their wallclock limit? If so,
I could have a cron job checking if there are pending jobs in the queue and
if not shut it off, and if so turn it on. Granted, that may be sloppy (e.g.
one job pending for one resources causing the cancelling of jobs using
other resources) but it's something and it'd be easy to implement compared
to the turn on/off pre-emption as discussed in a previous message.

Great conversation folks, enjoying reading the various perspectives at
different sites!

On Tue, Jun 17, 2025 at 12:26 AM Loris Bennett via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Hi Prentice,
>
> Prentice Bisbal via slurm-users
>
> <slurm-users@lists.schedmd.com> writes:
>
> > I think the idea of having a generous default timelimit is the wrong way
> to go. In fact, I think any defaults for jobs are a bad way to go.  The
> majority of your
> > users will just use that default time limit, and backfill scheduling
> will remain useless to you.
>
> Horses for courses, I would say.  We have a default time of 14 days, but
> because we also have QoS with increased priority, but shorter time
> limits, there is still an incentive for users to set the time limit
> themselves.  So currently we have around 900 jobs running, only 100 of
> which are using the default time limit.  Many of these will be
> long-running Gaussian jobs and will indeed need the time.
>
> > Instead, I recommend you use your job_submit.lua to reject all jobs that
> don't have a wallclock time and print out a helpful error message to inform
> users they
> > now need to specify a wallclock time, and provide a link to
> documentation on how to do that.
> >
> > Requiring users to specify a time limit themselves does two things:
> >
> > 1. It reminds them that it's important to be conscious of timelimits
> when submitting jobs
>
> This is a good point.  We use 'jobstats', which provides information
> after a job has completed, about run time relative to time limit,
> amongst other things, although unfortunately many people don't seem to
> read this.  However, even if you do force people to set a time limit,
> they can still choose not to think about it and just set the maximum.
>
> > 2. If a job is killed before it's done and all the progress is lost
> because the job wasn't checkpointing, they can't blame you as the admin.
>
> I don't really understand this point.  The limit is just the way it is,
> just as we have caps on the total number of cores or GPUs the jobs given
> user can use at any one time.  Up to now no-one has blamed us for this.
>
> > If you do this, it's easy to get the users on board by first providing
> useful and usable documentation on why timelimits are needed and how to set
> them. Be
> > sure to hammer home the point that effective timelimits can lead to
> their jobs running sooner, and that effective timelimits can increase
> cluster
> > efficiency/utilization, helping them get a better return on their
> investment (if they contribute to the clusters cost) or they'll get more
> science done. I like to
> > frame it that accurate wallclock times will give them a competitive edge
> in getting their jobs running before other cluster users. Everyone likes to
> think what
> > they're doing will give them an advantage!
>
> I agree with all this and this is also what we also try to do.  The only
> thing I don't concur with is your last sentence.  In my experience, as
> long as things work, users will in general not give a fig about whether
> they are using resources efficiently.  Only when people notice a delay
> in jobs starting do they become more aware about it and are prepared to
> take action.  It is particularly a problem with new users, because
> fairshare means that their jobs will start pretty quickly, no matter how
> inefficiently they have configured them.  Maybe we should just give new
> users fewer share initially and only later bump them up to some standard
> value.
>
> Cheers,
>
> Loris
>
> > My 4 cents (adjusted for inflation).
> >
> > Prentice
> >
> > On 6/12/25 9:11 PM, Davide DelVento via slurm-users wrote:
> >
> >  Sounds good, thanks for confirming it.
> >  Let me sleep on it wrt the "too many" QOS, or think if I should ditch
> this idea.
> >  If I'll implement it, I'll post in this conversation details on how I
> did it.
> >  Cheers
> >
> >  On Thu, Jun 12, 2025 at 6:59 AM Ansgar Esztermann-Kirchner <
> aesz...@mpinat.mpg.de> wrote:
> >
> >  On Thu, Jun 12, 2025 at 04:52:24AM -0600, Davide DelVento wrote:
> >  > Hi Ansgar,
> >  >
> >  > This is indeed what I was looking for: I was not aware of
> PreemptExemptTime.
> >  >
> >  > From my cursory glance at the documentation, it seems
> >  > that PreemptExemptTime is QOS-based and not job based though. Is that
> >  > correct? Or could it be set per-job, perhaps on a prolog/submit lua
> script?
> >
> >  Yes, that's correct.
> >  I guess you could create a bunch of QOS with different
> >  PremptExemptTimes and then let the user select one (or indeed select
> >  it from lua) but as far as I know, there is no way to set arbitrary
> >  per-job values.
> >
> >  Best,
> >
> >  A.
> >  --
> >  Ansgar Esztermann
> >  Sysadmin Dep. Theoretical and Computational Biophysics
> >  https://www.mpinat.mpg.de/person/11315/3883774
> >
> >
> > --
> > Prentice Bisbal
> > HPC Systems Engineer III
> > Computational & Information Systems Laboratory (CISL)
> > NSF National Center for Atmospheric Research (NSF NCAR)
> > https://www.cisl.ucar.edu
> > https://ncar.ucar.edu
> --
> Dr. Loris Bennett (Herr/Mr)
> FUB-IT, Freie Universität Berlin
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Implementing a "soft" wall clock limit

Reply via email to