Hey slurm gurus. We have been trying to enable slurm QOS on a cray system here off and on for quite a while but can never get it working. Every time we try to enable QOS we disrupt the cluster and users and have to fall back. I'm not sure what we are doing wrong. We run a pretty open system here since we are a research group but there are time where we need to let a user run a job to exceed a partition limit. In lieu of using QOs the only other way we have figured out how to do this is create a new partition and push out the modified slurm.conf. It's a hassle.
I'm not sure what information is needed exactly to troubleshoot this but I understand to enable QOS we need to enable this line in slurm.conf AccountingStorageEnforce=limits,qos Every time we attempt this no one can submit a job, slurm says waiting on resources I believe. We have accounting enabled and everyone is a member of the default qos group "normal". Configuration data as of 2019-03-05T09:36:19 AccountingStorageBackupHost = (null) AccountingStorageEnforce = none AccountingStorageHost = hickory-1 AccountingStorageLoc = N/A AccountingStoragePort = 6819 AccountingStorageTRES = cpu,mem,energy,node,bb/cray,gres/craynetwork,gres/gpu AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreJobComment = Yes AcctGatherEnergyType = acct_gather_energy/rapl AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInfinibandType = acct_gather_infiniband/none AcctGatherNodeFreq = 30 sec AcctGatherProfileType = acct_gather_profile/none AllowSpecResourcesUsage = 1 AuthInfo = (null) AuthType = auth/munge BackupAddr = hickory-2 BackupController = hickory-2 BatchStartTimeout = 10 sec BOOT_TIME = 2019-03-04T16:11:55 BurstBufferType = burst_buffer/cray CacheGroups = 0 CheckpointType = checkpoint/none ChosLoc = (null) ClusterName = hickory CompleteWait = 0 sec ControlAddr = hickory-1 ControlMachine = hickory-1 CoreSpecPlugin = cray CpuFreqDef = Performance CpuFreqGovernors = Performance,OnDemand CryptoType = crypto/munge DebugFlags = (null) DefMemPerNode = UNLIMITED DisableRootJobs = No EioTimeout = 60 EnforcePartLimits = NO Epilog = (null) EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = ext_sensors/none ExtSensorsFreq = 0 sec FairShareDampeningFactor = 1 FastSchedule = 0 FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = gpu,craynetwork GroupUpdateForce = 1 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 0 sec HealthCheckNodeState = ANY HealthCheckProgram = (null) InactiveLimit = 0 sec JobAcctGatherFrequency = 30 JobAcctGatherType = jobacct_gather/linux JobAcctGatherParams = (null) JobCheckpointDir = /var/slurm/checkpoint JobCompHost = localhost JobCompLoc = /var/log/slurm_jobcomp.log JobCompPort = 0 JobCompType = jobcomp/none JobCompUser = root JobContainerType = job_container/cncu JobCredentialPrivateKey = (null) JobCredentialPublicCertificate = (null) JobFileAppend = 0 JobRequeue = 1 JobSubmitPlugins = cray KeepAliveTime = SYSTEM_DEFAULT KillOnBadExit = 1 KillWait = 30 sec LaunchParameters = (null) LaunchType = launch/slurm Layouts = Licenses = (null) LicensesUsed = (null) MailDomain = (null) MailProg = /bin/mail MaxArraySize = 1001 MaxJobCount = 10000 MaxJobId = 67043328 MaxMemPerCPU = 128450 MaxStepCount = 40000 MaxTasksPerNode = 512 MCSPlugin = mcs/none MCSParameters = (null) MemLimitEnforce = Yes MessageTimeout = 10 sec MinJobAge = 300 sec MpiDefault = none MpiParams = ports=20000-32767 MsgAggregationParams = (null) NEXT_JOB_ID = 244342 NodeFeaturesPlugins = (null) OverTimeLimit = 0 min PluginDir = /opt/slurm/17.02.6/lib64/slurm PlugStackConfig = /etc/opt/slurm/plugstack.conf PowerParameters = (null) PowerPlugin = PreemptMode = OFF PreemptType = preempt/none PriorityParameters = (null) PriorityDecayHalfLife = 7-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = No PriorityFlags = PriorityMaxAge = 7-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 0 PriorityWeightFairShare = 0 PriorityWeightJobSize = 0 PriorityWeightPartition = 0 PriorityWeightQOS = 0 PriorityWeightTRES = (null) PrivateData = none ProctrackType = proctrack/cray Prolog = (null) PrologEpilogTimeout = 65534 PrologSlurmctld = (null) PrologFlags = (null) PropagatePrioProcess = 0 PropagateResourceLimits = (null) PropagateResourceLimitsExcept = AS RebootProgram = (null) ReconfigFlags = (null) RequeueExit = (null) RequeueExitHold = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 60 sec ResvEpilog = (null) ResvOverRun = 0 min ResvProlog = (null) ReturnToService = 2 RoutePlugin = route/default SallocDefaultCommand = (null) SbcastParameters = (null) SchedulerParameters = (null) SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill SelectType = select/cray SelectTypeParameters = CR_CORE_MEMORY,OTHER_CONS_RES,NHC_ABSOLUTELY_NO SlurmUser = root(0) SlurmctldDebug = info SlurmctldLogFile = /var/spool/slurm/slurmctld.log SlurmctldPort = 6817 SlurmctldTimeout = 120 sec SlurmdDebug = info SlurmdLogFile = /var/spool/slurmd/%h.log SlurmdPidFile = /var/spool/slurmd/slurmd.pid SlurmdPlugstack = (null) SlurmdPort = 6818 SlurmdSpoolDir = /var/spool/slurmd SlurmdTimeout = 300 sec SlurmdUser = root(0) SlurmSchedLogFile = (null) SlurmSchedLogLevel = 0 SlurmctldPidFile = /var/spool/slurm/slurmctld.pid SlurmctldPlugstack = (null) SLURM_CONF = /etc/opt/slurm/slurm.conf SLURM_VERSION = 17.02.6 SrunEpilog = (null) SrunPortRange = 0-0 SrunProlog = (null) StateSaveLocation = /apps/cluster/hickory/slurm/ SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = NONE SuspendTimeout = 30 sec SwitchType = switch/cray TaskEpilog = (null) TaskPlugin = task/cray,task/affinity,task/cgroup TaskPluginParam = (null type) TaskProlog = (null) TCPTimeout = 2 sec TmpFS = /tmp TopologyParam = (null) TopologyPlugin = topology/none TrackWCKey = No TreeWidth = 50 UsePam = 0 UnkillableStepProgram = (null) UnkillableStepTimeout = 60 sec VSizeFactor = 0 percent WaitTime = 0 sec Slurmctld(primary/backup) at hickory-1/hickory-2 are UP/UP