[slurm-users] Re: Suspending jobs and resuming

2024-11-21 Thread Diego Zuccato via slurm-users
IIUC, when you suspend a job it remains in memory but with no CPU time allocated. If you reboot the node, the job state is lost (unless it uses checkpointing). When you restarted the jobs, they actually began a new run (Slurm doesn't know if they use checkpointing or not). You've been lucky tha

[slurm-users] Suspending jobs and resuming

2024-11-21 Thread Ratnasamy, Fritz via slurm-users
Hi, I am using an old slurm version 20.11.8 and we had to reboot our cluster today for maintenance. I suspended all the jobs on it with the command scontrol suspend list_job_ids and all the jobs paused and were suspended. However, when I tried to resume them after the reboot, scontrol resume did