[slurm-users] Clearing the "maint" flag

2019-07-16 Thread Stradling, Alden Reid (ars9ac)
After coming out of maintenance, I have a large number of nodes with the "maint" flag still set after deleting the maintenance reservation. I have attempted to clear it using scontrol a variety of ways, but to no avail. Has anyone seen this? Has anyone a solution short of mass node reboots? Tha

[slurm-users] Debugging accounting/QOS policy errors

2018-06-22 Thread Stradling, Alden Reid (ars9ac)
I just spent another fun hour diffing out why I got the classic: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits) I dug it out with the tried and true sacctmgr show associations and scontrol show partition and etc -- but are there better tools to get at the

[slurm-users] Intermittent "Not responding" status

2017-12-04 Thread Stradling, Alden Reid (ars9ac)
I have a number of nodes that have, after our transition to Centos 7.3/SLURM 17.02, begun to occasionally display a status of "Not responding". The health check we run on each node every 5 minutes detects nothing, and the nodes are perfectly healthy once I set their state to "idle". The slurmd c

Re: [slurm-users] Quick hold on all partitions, all jobs

2017-11-08 Thread Stradling, Alden Reid (ars9ac)
We use something like this: scontrol create reservation starttime=2017-11-08T06:00:00 duration=1440 user=root flags=maint,ignore_jobs nodes=ALL Reservation created: root_2 Then confirm: scontrol show reservation ReservationName=root_2 StartTime=2017-11-08T06:00:00 EndTime=2017-11-09T06:00:0