After coming out of maintenance, I have a large number of nodes with the
"maint" flag still set after deleting the maintenance reservation. I have
attempted to clear it using scontrol a variety of ways, but to no avail. Has
anyone seen this? Has anyone a solution short of mass node reboots?
Tha
I just spent another fun hour diffing out why I got the classic:
Job violates accounting/QOS policy (job submit limit, user's size and/or time
limits)
I dug it out with the tried and true sacctmgr show associations and scontrol
show partition and etc -- but are there better tools to get at the
I have a number of nodes that have, after our transition to Centos 7.3/SLURM
17.02, begun to occasionally display a status of "Not responding". The health
check we run on each node every 5 minutes detects nothing, and the nodes are
perfectly healthy once I set their state to "idle". The slurmd c
We use something like this:
scontrol create reservation starttime=2017-11-08T06:00:00 duration=1440
user=root flags=maint,ignore_jobs nodes=ALL
Reservation
created: root_2
Then confirm:
scontrol show reservation
ReservationName=root_2 StartTime=2017-11-08T06:00:00
EndTime=2017-11-09T06:00:0