[slurm-users] Re: Suspending jobs and resuming

Diego Zuccato via slurm-users Thu, 21 Nov 2024 22:50:04 -0800

IIUC, when you suspend a job it remains in memory but with no CPU timeallocated. If you reboot the node, the job state is lost (unless it usescheckpointing). When you restarted the jobs, they actually began a newrun (Slurm doesn't know if they use checkpointing or not). You've beenlucky that your jobs seems to use checkpointing...

The pocedure we're following when a node reboot is required is to createa reservation (or drain the nodes), leave jobs run until completion ortime limit and when the nodes are free we reboot 'em.


Diego

Il 22/11/2024 00:05, Ratnasamy, Fritz via slurm-users ha scritto:

Hi,
I am using an old slurm version 20.11.8 and we had to reboot ourcluster today for maintenance. I suspended all the jobs on it with thecommand scontrol suspend list_job_ids and all the jobs paused and weresuspended. However, when I tried to resume them after the reboot,scontrol resume did not work (it was showing in the reason column" (JobHeldAdmin)". I was able to release them with scontrol release andthe jobs started to run back. However, the SLURM recorded time on itresetted (Time columns, showing 0:00 for all the jobs) though the jobsseem to have re-started from the last point before he got suspended.
1- Did I follow the right procedure to suspend, reboot and resume/release?
2- In this case, does the wall time for all the jobs goes into reset andtherefore anyone with slurm admin rights will be able to have their jobslast longer than the wall time limit by suspending and resuming a job?
Best,

*Fritz Ratnasamy*

Data Scientist

Information Technology


--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Suspending jobs and resuming

Reply via email to