Hi all,

I recently performed a forklift upgrade of SLURM at my site from version
14.10.X to the most recent available, 17.02.8. Since there were so many
versions skipped, we didn't try to do a phased series of upgrades, rather,
we just used sacctmgr show commands to dump users, accounts, etc, then
purged the old accounting database and reinstalled from scratch.

After starting things back up again, most uses of SLURM seem to be working
fine as before, however, there is one particular application feeding jobs
into SLURM that isn't quite getting the scheduling that it needs. The
steward of this application tells me that the problem occurred somewhat on
the old version, 14.10.X but it is "worse" now.

We haven't really changed anything in our slurm.conf between versions. It's
not clear to me if there is something we need to set, to make 17.02.8
behave like 14.X in terms of how it scheduled jobs, or if we've just never
had the QoS defined right for this workflow and it's biting us more in this
version, or some kind of FairShare issue, etc.

So far, the ONLY change we've made to slurm.conf is to add
assoc_limit_continue to our SchedulerParameters directive. According to my
end user, this helped somewhat, but the behavior is still worse than it was
in 14.10.X.

The problem is basically, we have a partition with 10 large nodes (120
thread, 1.5T RAM, etc) that is being fed jobs under the auspices of an
application service user, and we are seeing the machines quite
underutilized. Only a very small portion of jobs in queue (< 5%) run while
the rest sit, in the queue pending with various states including
ReqNodeNotAvail (Resources), QOSMaxCPUPerUserLimit or just Resources. There
should be plenty of resources available, none of the ten machines is
running more than 5 jobs concurrently and some are just totally idle,
running no jobs at all.

Here's a description of what my user is seeing:
--------------------------------------------------------------


Define 30+ QOS like this:
sacctmgr create qos name=$1 priority=1000 maxcpusperuser=$2
sacctmgr update user name=USER set qos+=$1

The intent is provide a means to force particular jobs to certain hosts
where the data is local. The QOS is intended to prevent the target host
from overwhelming the local file system by running too many of the same
commands concurrently

Submit sbatch jobs this way
sbatch -p PARTITION --mem=8G --nodelist=HOSTNAME ==qos=QOS
--workdir=CONSOLELOG -J JOBNAME --output=CONSOLELOG/OUTPUT
BASH_COMMAND_AND_ARGS

Starting with all hosts idle, begin submitting jobs which scatter around to
the various hosts using a QOS for each host. Initially these jobs start up
right away.

Continue submitting jobs (usually in small blocks (10-100) every few
minutes, but the rate does not seem to matter). Jobs show up in their QOS
as queued, When the number of total jobs begins to get into the hundreds, I
see some machines begin to stop starting more jobs.

At this point with hundreds queued, I often see the running jobs on a host
drop to zero as the previous running jobs complete. The host may sit idle
for minutes to hours (even forever). All nine hosts behave in a similar
fashion, but never consistently. This behavior is independent of the 'size'
of the job (some run in 2 minutes, some in 12+ hours).

It SEEMS the more jobs queued, the more likely this slowdown/stop happens.

These nine machines are all nearly identical and capable of running several
hundred concurrent jobs (as long as the job mix is 'right'). This behavior
was true for SLURM version 14 and now the most recent (17.02.8).

This is similar to what I saw in the old version. There definitely was
something way bad with Friday's setup. I GUESS your change improved this
version to behave like the older SLURM.

In the old version I could sometimes (not always) get SLURM to run jobs by
setting the priority.   e.g.  scontrol update priority=11000 jobid=JOBID

Sometimes I'd even have to play games changing the priority value a few
times.  I figured this was just something broken in the old version. Seems
not, there's something more fundamentally wrong with how we have things set
up.


Here's some config relating to the scheduler from our slurm.conf that might
be germane:
---------------------------------------------------------------------------------------------------------------------

SchedulerType=sched/backfill
SchedulerPort=7321

SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory # changed to oversubscribe resources -
cjs 13/02/2016

SchedulerParameters=defer,sched_interval=120,bf_max_job_user=50,bf_max_job_test=50,bf_continue,assoc_limit_continue
PriorityType=priority/multifactor
PriorityFavorSmall=YES
PriorityMaxAge=30-0
PriorityWeightAge=100000
PriorityWeightFairshare=1000
PriorityWeightJobSize=1000
PriorityWeightQOS=1000000
AccountingStorageEnforce=limits,qos
PreemptMode = SUSPEND,GANG
PreemptType = preempt/partition_prio

And definition of the nodes and partitions:
-------------------------------------------------------

NodeName=DEFAULT Weight=20 RealMemory=1500000 Procs=120 TmpDisk=11400000
Gres=tmp:11000 Feature="eth-10g"
NodeName=bignode,bignode[2,4-5],otherbignode
NodeName=bignode3 Procs=112 TmpDisk=9000000 Gres=tmp:9000
NodeName=bignode6 Procs=80 TmpDisk=9000000 Gres=tmp:9000
NodeName=bignode[7-10] Procs=112 TmpDisk=25000000 Gres=tmp:25000

PartitionName=bignode-working   AllowGroups=bignodeusers
Priority=30  PreemptMode=off Nodes=bignode,bignode[2-7,9-10]

PartitionName=bignode-calling   AllowGroups=bignodeusers
Priority=30  PreemptMode=off Nodes=bignode8



I'd really appreciate any help the SLURM wizards can provide! We suspect
it's something to do with how we've set up QoS or maybe, we need to tweak
the scheduler configuration in 17.02.8 however there's no single clear path
forward. Just let me know if there's any further information I can provide
to help troubleshoot or give fodder for suggestions.


Thanks,


Sean

Reply via email to