Our cluster has some nodes separated to their own partition for running interactive sessions, which are required to be short and only use a few nodes. I've always disliked this approach because I see some of the interactive nodes being idle while other jobs are waiting on the batch partition.

I'd proposed that the "interactive" ought to just draw from the regular pool of nodes, parameterized as a QOS or another partition, as follows:

1. Only a few interactive jobs can run at a given time.
2. A single user can only have one interactive job running or queued.
3. Only a few nodes can be used by an interactive job.
4. The interactive jobs have higher priority than batch jobs.

The #4 would give the user a more immediate startup. Not quite as good as running from a separate pool of nodes, but I wouldn't expect the wait-times to be long on a big enough cluster.
Here's a problem the Admins ran into when they tried this sort of thing:

A. The predictive scheduler knows the maximum time a large job has to wait to gather all the nodes it needs, just by looking at the time-limits on all the jobs still running. B. If a higher-priority job comes in during this "gather" phase, though, it will steal one of the idle nodes that were held for the big job. C. Given that more nodes now need to be gathered, the predictive scheduler will assign a different maximum wait-time to this job, and may start a smaller job instead with the pool of nodes that have been accumulated.

The result is that the job-order can get perturbed quite a bit and a large job could end up waiting longer than if the interactive jobs drew from a separate pool of nodes. Also if it ends up running some smaller job first, not all of the gathered nodes would have needed to sit idle to begin with, and some node-hours will have gone to waste.
Do any of you know a way to control this?

If the "interactive" jobs were limited to, say, 10 total, the predictive scheduler could look at the time it would take to gather N+10 nodes instead of N, in which case I think the schedule would behave more deterministically. There'd be a special case if (N+10) is more than the number of nodes on the cluster, of course. And you wouldn't really need to schedule for (N+10) nodes, it would be (N+10-x) where "x" is the number of nodes currently being consumed by interactive jobs.


Reply via email to