Hi Slurm-Users, Hope this post finds all of you healthy and safe amidst the ongoing COVID19 craziness. We've got a strange error state that occurs when we enable preemption and we need help diagnosing what is wrong. I'm not sure if we are missing a default value or other necessary configuration, but while trying to enable slurm preemption on a cluster with multiple queues slurm itself stops reporting all 40+ CPUs on each node and only reports a single cpu per node [after some random amount of time]. This is problematic on multiple levels and has led to issues with users submitting jobs with more than one CPU.
For some quick background on our setup we have a 100+ node linux cluster which is built on lustre for storage, is managed using Bright View and uses Slurm for its scheduler. The slurm.conf file lives on a shared volume that is mounted across all the nodes on one of the lustre file systems. We have defined a number of queues for slurm to use and have three distinct tiers of workloads. Before setting out we looked around but were unable to find a succinct how-to on the web describing how to configure this type of 3-tier design we desired to make, so I'll outline the steps we took below. We've tried a number of variations of the Examples from https://slurm.schedmd.com/preempt.html but none exactly match the model we desire so it may be we are missing key configuration options still. The desired high-level design is for all compute and gpu nodes to exist in a lowest priority "windfall" queue (PriorityTier value of 100) with a medium priority pair of default queues above it (PriorityTier value of 200) -- these are called "defq" and "gpuq" for ease of use -- and finally 20 or so specific high-priority queues for particular research groups above that (PriorityTier value of 300) which are limited to just a few nodes per queue and should take final precedence. As for how to handle the preemption on each tier we don't plan to SUSPEND jobs, but rather to CANCEL a windfall job or REQUEUE a defq/gpuq job when a higher-priority job from the researcher specific queues requests a resource that is already in use by a lower priority job. The final layout looks something like so: PriorityTier PreemptMode QueueType NodeType 100 CANCEL windfall all 200 REQUEUE defq cpu 200 REQUEUE gpuq gpu 300 REQUEUE lab1 cpu 300 REQUEUE lab2 cpu 300 ... ... ... (etc.) Once this was laid out the next step was to ensure each queue that we created had a predefined value of "CANCEL" or "REQUEUE" rather than "OFF" before enabling the 'preempt/partition_prio' plugin or we'd get an error. Since the initial cluster design didn't use preemption we added the PriorityType line first: > PriorityType=priority/multifactor Then we added the following 2 lines to the slurm.conf config file which seemed to enable the preemption. > PreemptType=preempt/partition_prio > PreemptMode=REQUEUE As far as I understand those 2 lines should enable the plugin (and set the global default preemption mode for good measure). For testing the changes we created a smaller queue with only 3 nodes so that we could call up some interactive jobs and watch them be canceled or requeued as we request higher priority workloads. Our issue occurs when we enable the preempt type. At first everything seems to be working fine, however after some random amount of time all the nodes stop reporting 40+ CPUs and report only a single CPU. This is visible to the admin via `sinfo --Node --long` and to the users by the fact only single CPU jobs can be requested. It makes no sense. It's just like all of a sudden the computers only have one CPU. All the more frustrating is the fact it also doesn't stop misbehaving right away when we change it back to the previous configuration. Big question: Is this an issue anyone has seen before? Any clue what we are doing wrong or how to further diagnose the problem when it occurs? At the moment my thoughts for next steps are to turn up slurm debugging and to purposefully let the error happen again, but testing on a production cluster always scares me a little. Any thoughts about what log to check and what kind of events to watch for would be greatly appreciated. We are open to any thoughts or suggestions! Also a bit unclear about how the priority calculation is made. I looked at the values generated and they didn't seem to map to the changes in the queues PriorityValue. I tried limiting the priority calculation to ONLY use the partition priority with these additional config options below, but still didn't get a nice clean calculation like I hoped. > PriorityWeightFairshare=0 > PriorityWeightAge=0 > PriorityWeightTRES=0 > PriorityWeightPartition=100000 > PriorityWeightJobSize=0 > PriorityWeightQOS=0 Thanks in advance, Josh