Hello. I'm running Slurm 18.08.1 and had configured limits for our users using QOS. The default QOS has the limits set. Most users belong to this.
# sacctmgr show qos Name Priority GraceTime Preempt PreemptMode Flags UsageThres UsageFactor GrpTRES GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit GrpWall MaxTRES MaxTRESPerNode MaxTRESMins MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU MaxTRESPA MaxJobsPA MaxSubmitPA MinTRES ---------- ---------- ---------- ---------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- ------------- normal 0 00:00:00 cluster 1.000000 cpu=72,mem=7+ 10000 nav 0 00:00:00 cluster 1.000000 eva 0 00:00:00 cluster 1.000000 cpu=18,mem=1+ emre-high 0 00:00:00 cluster 1.000000 Nothing has changed recently, and today, I noticed that the QOS limits which were working until now has silently stopped working. A user was able to submit jobs enough to saturate the cluster singlehandedly annoying other users. There are no errors in slurmctld logs.. How can I go about troubleshooting this? Any suggestions welcome.. / Aravindh -- Aravindh Sampathkumar aravi...@fastmail.com