[slurm-users] Priority jobs interfering with predictive scheduling

Carl Ponder Wed, 12 Apr 2023 15:54:27 -0700

Our cluster has some nodes separated to their own partition for runninginteractive sessions, which are required to be short and only use a fewnodes.I've always disliked this approach because I see some of the interactivenodes being idle while other jobs are waiting on the batch partition.

I'd proposed that the "interactive" ought to just draw from the regularpool of nodes, parameterized as a QOS or another partition, as follows:


1. Only a few interactive jobs can run at a given time.
2. A single user can only have one interactive job running or queued.
3. Only a few nodes can be used by an interactive job.
4. The interactive jobs have higher priority than batch jobs.

The #4 would give the user a more immediate startup. Not quite as goodas running from a separate pool of nodes, but I wouldn't expect thewait-times to be long on a big enough cluster.

Here's a problem the Admins ran into when they tried this sort of thing:

A. The predictive scheduler knows the maximum time a large job has towait to gather all the nodes it needs, just by looking at thetime-limits on all the jobs still running.B. If a higher-priority job comes in during this "gather" phase, though,it will steal one of the idle nodes that were held for the big job.C. Given that more nodes now need to be gathered, the predictivescheduler will assign a different maximum wait-time to this job, and maystart a smaller job instead with the pool of nodes that have beenaccumulated.

The result is that the job-order can get perturbed quite a bit and alarge job could end up waiting longer than if the interactive jobs drewfrom a separate pool of nodes.Also if it ends up running some smaller job first, not all of thegathered nodes would have needed to sit idle to begin with, and somenode-hours will have gone to waste.

Do any of you know a way to control this?

If the "interactive" jobs were limited to, say, 10 total, the predictivescheduler could look at the time it would take to gather N+10 nodesinstead of N, in which case I think the schedule would behave moredeterministically.There'd be a special case if (N+10) is more than the number of nodes onthe cluster, of course.And you wouldn't really need to schedule for (N+10) nodes, it would be(N+10-x) where "x" is the number of nodes currently being consumed byinteractive jobs.

[slurm-users] Priority jobs interfering with predictive scheduling

Reply via email to