Hi all, I recently performed a forklift upgrade of SLURM at my site from version 14.10.X to the most recent available, 17.02.8. Since there were so many versions skipped, we didn't try to do a phased series of upgrades, rather, we just used sacctmgr show commands to dump users, accounts, etc, then purged the old accounting database and reinstalled from scratch.
After starting things back up again, most uses of SLURM seem to be working fine as before, however, there is one particular application feeding jobs into SLURM that isn't quite getting the scheduling that it needs. The steward of this application tells me that the problem occurred somewhat on the old version, 14.10.X but it is "worse" now. We haven't really changed anything in our slurm.conf between versions. It's not clear to me if there is something we need to set, to make 17.02.8 behave like 14.X in terms of how it scheduled jobs, or if we've just never had the QoS defined right for this workflow and it's biting us more in this version, or some kind of FairShare issue, etc. So far, the ONLY change we've made to slurm.conf is to add assoc_limit_continue to our SchedulerParameters directive. According to my end user, this helped somewhat, but the behavior is still worse than it was in 14.10.X. The problem is basically, we have a partition with 10 large nodes (120 thread, 1.5T RAM, etc) that is being fed jobs under the auspices of an application service user, and we are seeing the machines quite underutilized. Only a very small portion of jobs in queue (< 5%) run while the rest sit, in the queue pending with various states including ReqNodeNotAvail (Resources), QOSMaxCPUPerUserLimit or just Resources. There should be plenty of resources available, none of the ten machines is running more than 5 jobs concurrently and some are just totally idle, running no jobs at all. Here's a description of what my user is seeing: -------------------------------------------------------------- Define 30+ QOS like this: sacctmgr create qos name=$1 priority=1000 maxcpusperuser=$2 sacctmgr update user name=USER set qos+=$1 The intent is provide a means to force particular jobs to certain hosts where the data is local. The QOS is intended to prevent the target host from overwhelming the local file system by running too many of the same commands concurrently Submit sbatch jobs this way sbatch -p PARTITION --mem=8G --nodelist=HOSTNAME ==qos=QOS --workdir=CONSOLELOG -J JOBNAME --output=CONSOLELOG/OUTPUT BASH_COMMAND_AND_ARGS Starting with all hosts idle, begin submitting jobs which scatter around to the various hosts using a QOS for each host. Initially these jobs start up right away. Continue submitting jobs (usually in small blocks (10-100) every few minutes, but the rate does not seem to matter). Jobs show up in their QOS as queued, When the number of total jobs begins to get into the hundreds, I see some machines begin to stop starting more jobs. At this point with hundreds queued, I often see the running jobs on a host drop to zero as the previous running jobs complete. The host may sit idle for minutes to hours (even forever). All nine hosts behave in a similar fashion, but never consistently. This behavior is independent of the 'size' of the job (some run in 2 minutes, some in 12+ hours). It SEEMS the more jobs queued, the more likely this slowdown/stop happens. These nine machines are all nearly identical and capable of running several hundred concurrent jobs (as long as the job mix is 'right'). This behavior was true for SLURM version 14 and now the most recent (17.02.8). This is similar to what I saw in the old version. There definitely was something way bad with Friday's setup. I GUESS your change improved this version to behave like the older SLURM. In the old version I could sometimes (not always) get SLURM to run jobs by setting the priority. e.g. scontrol update priority=11000 jobid=JOBID Sometimes I'd even have to play games changing the priority value a few times. I figured this was just something broken in the old version. Seems not, there's something more fundamentally wrong with how we have things set up. Here's some config relating to the scheduler from our slurm.conf that might be germane: --------------------------------------------------------------------------------------------------------------------- SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory # changed to oversubscribe resources - cjs 13/02/2016 SchedulerParameters=defer,sched_interval=120,bf_max_job_user=50,bf_max_job_test=50,bf_continue,assoc_limit_continue PriorityType=priority/multifactor PriorityFavorSmall=YES PriorityMaxAge=30-0 PriorityWeightAge=100000 PriorityWeightFairshare=1000 PriorityWeightJobSize=1000 PriorityWeightQOS=1000000 AccountingStorageEnforce=limits,qos PreemptMode = SUSPEND,GANG PreemptType = preempt/partition_prio And definition of the nodes and partitions: ------------------------------------------------------- NodeName=DEFAULT Weight=20 RealMemory=1500000 Procs=120 TmpDisk=11400000 Gres=tmp:11000 Feature="eth-10g" NodeName=bignode,bignode[2,4-5],otherbignode NodeName=bignode3 Procs=112 TmpDisk=9000000 Gres=tmp:9000 NodeName=bignode6 Procs=80 TmpDisk=9000000 Gres=tmp:9000 NodeName=bignode[7-10] Procs=112 TmpDisk=25000000 Gres=tmp:25000 PartitionName=bignode-working AllowGroups=bignodeusers Priority=30 PreemptMode=off Nodes=bignode,bignode[2-7,9-10] PartitionName=bignode-calling AllowGroups=bignodeusers Priority=30 PreemptMode=off Nodes=bignode8 I'd really appreciate any help the SLURM wizards can provide! We suspect it's something to do with how we've set up QoS or maybe, we need to tweak the scheduler configuration in 17.02.8 however there's no single clear path forward. Just let me know if there's any further information I can provide to help troubleshoot or give fodder for suggestions. Thanks, Sean
