Hi,
I'd like to allow job suspension in my cluster, without the "penalty" of RAM utilization. The jobs are sometimes very big and can require ~100GB mem on each node. Suspending such a job would usually mean almost nothing else can run on the same node, except for very small memory jobs. Currently the solution is requeue preemptionĀ with or without checkpointing. I don't want to use swap for running jobs, ever - I'd rather get
OOM killed than use swap while the job is running.
Is there a way to tell Slurm to allocate swap and use it only for suspending, to allow preemption without terminating the jobs?
The nodes haveĀ ~TB of disk space each, and most jobs never utilize any of that (relying on shared storage instead), so local disk space is usually not a concern.
Using swap to store suspended jobs, while slow to freeze and thaw, seems o me to be a better localized solution than checkpointing and requeuing, allowing the job to resume "immediately" (sans disk io times) after the high priority job finishes, but if I'm mistaken, please enlighten me.
I was wandering if simply setting a large swap in linux, while setting AllowedSwapSpace=0 in cgroup.conf would work, but I suspect the following: 1. Even suspended, the job still remains in it's cgroup limits, and 2. Which process gets swapped is non-deterministic from my point
of view - I'm not sure the kernel will swap out the suspended job
rather than the new job, at least in it's early stages. Thanks in advance, --Dani_L. |