Hi all. We want to achieve a simple thing with slurm: launch "normal" jobs, and be able to launch "high priority" jobs that run as soon as possible. End of it. However we cannot achieve this in a reliable way, meaning that our current config sometimes works, sometimes not, and this is driving us crazy.
When it works, this is what happens: - we have, let's say, 10 jobs running with normal priority (--qos=normal, having final Priority=1001) and few thousands in PENDING state - we submit a new job with high priority (--qos=high, having final Priority=1001001) - at this point, slurm waits until the normal priority job will end to free up required resources, and then starts a new High priority job. That's Perfect! However, from time to time, randomly, this does not happen. Here is an example: # the node has around 200GB of memory and 24 CPUs Partition=t1 State=PD Priority=1001001 Nice=0 ID=337455 CPU=24 Memory=80G Nice=0 Started=0:00 User=u1 Submitted=2020-07-07T07:16:47 Partition=t1 State=R Priority=1001 Nice=0 ID=337475 CPU=1 Memory=1024M Nice=0 Started=1:22 User=u1 Submitted=2020-07-07T10:31:46 Partition=t1 State=R Priority=1001 Nice=0 ID=334355 CPU=1 Memory=1024M Nice=0 Started=58:09 User=u1 Submitted=2020-06-23T09:57:11 Partition=t1 State=R Priority=1001 Nice=0 ID=334354 CPU=1 Memory=1024M Nice=0 Started=6:29:59 User=u1 Submitted=2020-06-23T09:57:11 Partition=t1 State=R Priority=1001 Nice=0 ID=334353 CPU=1 Memory=1024M Nice=0 Started=13:25:55 User=u1 Submitted=2020-06-23T09:57:11 [...] You see? Slurm keep starting jobs that have a lower priority. Why is that? Some info about our config: Slurm is version 16.05. Here is the priority config of slurm: ##### file /etc/slurm-llnl/slurm.conf PriorityType=priority/multifactor PriorityFavorSmall=NO PriorityWeightQOS=1000000 PriorityWeightFairshare=1000 PriorityWeightPartition=1000 PriorityWeightJobSize=0 PriorityWeightAge=0 ##### command "sacctmgr show qos" Name Priority MaxSubmitPA normal 0 30 high 1000 Any idea? Thanks