Hi,

  I am using an old slurm version 20.11.8 and we had to reboot our cluster
today for maintenance. I suspended all the jobs on it with the command
scontrol suspend list_job_ids and all the jobs paused and were suspended.
However, when I tried to resume them after the reboot, scontrol resume did
not work (it was showing in the reason column " (JobHeldAdmin)". I was able
to release them with scontrol release and the jobs started to run back.
However, the SLURM recorded time on it resetted (Time columns, showing 0:00
for all the jobs) though the jobs seem to have re-started from the last
point before he got suspended.
1- Did I follow the right procedure to suspend, reboot and resume/release?
2- In this case, does the wall time for all the jobs goes into reset and
therefore anyone with slurm admin rights will be able to have their jobs
last longer than the wall time limit by suspending and resuming a job?

Best,

*Fritz Ratnasamy*

Data Scientist

Information Technology
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to