[slurm-users] How to enable QOS correctly?

Matthew BETTINGER Tue, 05 Mar 2019 07:40:36 -0800

Hey slurm gurus.  We have been trying to enable slurm QOS on a cray system here 
off and on for quite a while but can never get it working.  Every time we try 
to enable QOS we disrupt the cluster and users and have to fall back.  I'm not 
sure what we are doing wrong.  We run a pretty open system here since we are a 
research group but there are time where we need to let a user run a job to 
exceed a partition limit.  In lieu of using QOs the only other way we have 
figured out how to do this is create a new partition and push out the modified 
slurm.conf.  It's a hassle.


I'm not sure what information is needed exactly to troubleshoot this but I 
understand to enable QOS we need to enable this line in slurm.conf

AccountingStorageEnforce=limits,qos

Every time we attempt this no one can submit a job, slurm says waiting on 
resources I believe.

We have accounting enabled  and everyone is a member of the default qos group 
"normal".  

Configuration data as of 2019-03-05T09:36:19
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost   = hickory-1
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,bb/cray,gres/craynetwork,gres/gpu
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/rapl
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInfinibandType = acct_gather_infiniband/none
AcctGatherNodeFreq      = 30 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = 1
AuthInfo                = (null)
AuthType                = auth/munge
BackupAddr              = hickory-2
BackupController        = hickory-2
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2019-03-04T16:11:55
BurstBufferType         = burst_buffer/cray
CacheGroups             = 0
CheckpointType          = checkpoint/none
ChosLoc                 = (null)
ClusterName             = hickory
CompleteWait            = 0 sec
ControlAddr             = hickory-1
ControlMachine          = hickory-1
CoreSpecPlugin          = cray
CpuFreqDef              = Performance
CpuFreqGovernors        = Performance,OnDemand
CryptoType              = crypto/munge
DebugFlags              = (null)
DefMemPerNode           = UNLIMITED
DisableRootJobs         = No
EioTimeout              = 60
EnforcePartLimits       = NO
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FairShareDampeningFactor = 1
FastSchedule            = 0
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = gpu,craynetwork
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 0 sec
JobAcctGatherFrequency  = 30
JobAcctGatherType       = jobacct_gather/linux
JobAcctGatherParams     = (null)
JobCheckpointDir        = /var/slurm/checkpoint
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/cncu
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = cray
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 1
KillWait                = 30 sec
LaunchParameters        = (null)
LaunchType              = launch/slurm
Layouts                 =
Licenses                = (null)
LicensesUsed            = (null)
MailDomain              = (null)
MailProg                = /bin/mail
MaxArraySize            = 1001
MaxJobCount             = 10000
MaxJobId                = 67043328
MaxMemPerCPU            = 128450
MaxStepCount            = 40000
MaxTasksPerNode         = 512
MCSPlugin               = mcs/none
MCSParameters           = (null)
MemLimitEnforce         = Yes
MessageTimeout          = 10 sec
MinJobAge               = 300 sec
MpiDefault              = none
MpiParams               = ports=20000-32767
MsgAggregationParams    = (null)
NEXT_JOB_ID             = 244342
NodeFeaturesPlugins     = (null)
OverTimeLimit           = 0 min
PluginDir               = /opt/slurm/17.02.6/lib64/slurm
PlugStackConfig         = /etc/opt/slurm/plugstack.conf
PowerParameters         = (null)
PowerPlugin             =
PreemptMode             = OFF
PreemptType             = preempt/none
PriorityParameters      = (null)
PriorityDecayHalfLife   = 7-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           =
PriorityMaxAge          = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 0
PriorityWeightFairShare = 0
PriorityWeightJobSize   = 0
PriorityWeightPartition = 0
PriorityWeightQOS       = 0
PriorityWeightTRES      = (null)
PrivateData             = none
ProctrackType           = proctrack/cray
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = (null)
PropagatePrioProcess    = 0
PropagateResourceLimits = (null)
PropagateResourceLimitsExcept = AS
RebootProgram           = (null)
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 2
RoutePlugin             = route/default
SallocDefaultCommand    = (null)
SbcastParameters        = (null)
SchedulerParameters     = (null)
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cray
SelectTypeParameters    = CR_CORE_MEMORY,OTHER_CONS_RES,NHC_ABSOLUTELY_NO
SlurmUser               = root(0)
SlurmctldDebug          = info
SlurmctldLogFile        = /var/spool/slurm/slurmctld.log
SlurmctldPort           = 6817
SlurmctldTimeout        = 120 sec
SlurmdDebug             = info
SlurmdLogFile           = /var/spool/slurmd/%h.log
SlurmdPidFile           = /var/spool/slurmd/slurmd.pid
SlurmdPlugstack         = (null)
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurmd
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/spool/slurm/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /etc/opt/slurm/slurm.conf
SLURM_VERSION           = 17.02.6
SrunEpilog              = (null)
SrunPortRange           = 0-0
SrunProlog              = (null)
StateSaveLocation       = /apps/cluster/hickory/slurm/
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/cray
TaskEpilog              = (null)
TaskPlugin              = task/cray,task/affinity,task/cgroup
TaskPluginParam         = (null type)
TaskProlog              = (null)
TCPTimeout              = 2 sec
TmpFS                   = /tmp
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = No
TreeWidth               = 50
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec

Slurmctld(primary/backup) at hickory-1/hickory-2 are UP/UP

[slurm-users] How to enable QOS correctly?

Reply via email to