Hi Prentice, Have you considered Slurm features and constraints at all? You provide features (arbitrary strings in your slurm.conf) of what your hardware can provide ("amd", "ib", "FAST", "whatever"). A user then will list constraints using typical and/or/regex notation ( --constraint=amd&ib ). You may override or autofill constraint defaults yourself in your job_submit.lua.
Another level: you may also create your own Slurm arguments to sbatch or srun using SPANK plugins. These could be used to simplify a constraint list in whatever way you might see fit (e.g. sbatch --fast equates to --constraint=amd&ib&FAST ). So, as a possibility, keep all nodes in one partition, supply the features in slurm.conf, have job_sumbit.lua give a default set of constraints (and/or force the user to provide a minimum set), create another partition that includes all the nodes as well but is preemptable/VIP/whatever (PriorityTiers work nice here too). There are several ways to approach this and I imagine you really wish the users to be able to "just submit" with a minimum of effort and information on their part while your life is also manageable for changes or updates. I find the logic of the feature/constraint system to be quite elegant for meeting complex needs of heterogeneous systems. Best, Cyrus On 1/22/19 2:49 PM, Prentice Bisbal wrote: > I left out a a *very* critical detail: One of the reasons I'm looking > at revamping my Slurm configuration is that my users have requested > the capability to submit long-running, low-priority interruptible jobs > that can be killed and requeued when shorter-running, higher-priority > jobs need to use the resources. > > Prentice Bisbal > Lead Software Engineer > Princeton Plasma Physics Laboratory > http://www.pppl.gov > > On 1/22/19 3:38 PM, Prentice Bisbal wrote: >> Slurm Users, >> >> I would like your input on the best way to configure Slurm for a >> heterogeneous cluster I am responsible for. This e-mail will probably >> be a bit long to include all the necessary details of my environment >> so thanks in advance to those of you who read all of it! >> >> The cluster I support is a very heterogeneous cluster with several >> different network technologies and generations of processors. >> Although some people here refer to this cluster as numerous l >> different clusters, in reality it is one cluster, since all the nodes >> have their work assigned to them from a single Slurm Controller, all >> the nodes use the same executables installed on a shared drive, and >> all nodes are diskless and use the same NFSroot OS image, so they are >> all configured 100% alike. >> >> The cluster has been built piece-meal over a number of years, which >> explains the variety of hardware/networking in use. In Slurm, each of >> the different "clusters" is a separate partition intended to serve >> different purposes: >> >> Partition "E" - AMD Opteron 6320 processors, 64 GB RAM/node, 1 GbE, >> meant for serial, and low task count parallel jobs that only use a >> few cores and stay within a single node. Limited to 16 tasks or less >> in QOS >> >> Partition "D" - AMD Opteron 6136, 6274, and 6376 processors, 32 GB or >> 64 GB RAM per node, 10 GbE, meant for general-purpose parallel jobs >> spanning multiple nodes. Min. Task count of 32 tasks to prevent >> smaller jobs that should be run on Partition E from running here. >> >> Partition "K" - AMD Opteron 6274 and 6376 processors, 64 GB RAM per >> node, DDR IB network, meant for tightly-coupled parallel jobs >> >> Partition "G1" - AMD Opteron 6274, 6276, 6376, and Intel Xeon E5-2698 >> v3 & E5-2630 v3 processors, RAM ranging from 128 GB - 512 GB per >> node, 1 GbE Network, meant for "large memory" jobs - some nodes are >> in different racks attached to different switches, so not really >> optimal for multi-node jobs. >> >> Partition "J" - AMD Opteron 6136 Processors, 280 GB RAM per node, >> DDR IB, was orginally meant for a specific project, I now need to >> allow general access to it. >> >> Partition "G2" - AMD Opteron 6136, 6274, and 6320 processors, 32 GB, >> 96 GB, and 128 GB RAM per node, IB network , access is restricted to >> specific users/projects. >> >> Partition "M" - Intel Xeon E5-2698 v3 and E5-2697A v4 processors, 128 >> GB RAM per node, 1 GbE network, reserved for running 1 specific >> simulation application. >> >> To make all this work so far, I have created a job_submit.lua script >> with numerous checks and conditionals that has become quite unwieldy. >> As a result, changes that should be simple take a considerable amount >> of time for me to rewrite and test the script. On top of that, almost >> all of the logic in that script is logic that Slurm can already >> perform in a more easily manageable way. I've essentially re-invented >> wheels that Slurm already provides. >> >> Further, each partition has it's own QOS, so my job_submit.lua >> assigns each job to a specific partition and QOS depending on it's >> resource requirements. This means that a job may be assigned to D, >> but could also run on K if K is idle , will never be able to run on >> K. This means cluster nodes could go unutilized, reducing cluster >> utilization states (which management looks at), and increasing job >> queue time (which users are obsessed with). >> >> I would like to simplify this configuration as much as possible to >> reduce the labor it takes me to maintain my job_submit.lua script, >> and therefore make me more responsive to meeting my users needs, and >> increase cluster utilization. Since I have numerous different >> networks, I was thinking the I could use the topology,conf file to >> keep jobs on a single network, and prevent multi-node jobs run on >> partition E. The partitions reserved for specific >> projects/departments would still need to be requested explicitly. >> >> At first, I was going to take this approach: >> >> 1. Create a single partition with all the general access nodes >> >> 2. Create a topology.conf file to make sure jobs stay within a single >> network. >> >> 3. Assign weights to the different partitions to that Slurm will try >> to assign jobs to them in a specific order of preference >> >> 4. Assign weights to the different nodes, so that the nodes with the >> fastest processors are preferred. >> >> After getting responses to my questions about the topology.conf file, >> this seems like this approach may not be viable, or at least not be >> best procedure. >> >> I'm am now considering this: >> >> 0. Restrict access to the non-general access partitions (this is >> already done for the most part, hence step 0). >> >> 1. Assign each Partition it's own QOS in the slurm.conf file. >> >> 2. Assign a weight to the partitions so Slurm attempts to assign jobs >> to them in a specific order. >> >> 3. Assign weights to the nodes so the nodes are assigned in a >> specific order (faster processors first) >> >> 4. Set job_submit plugin to all_partitions, or partition >> >> >> Step 4 in this case is the area I'm the least familiar with. One of >> the reasons we are using a job_submit.lua script is because users >> will often request partitions that are inappropriate for their job >> needs (like trying to run a job that spans multiple nodes on a >> partition with only 1 GbE, or request partition G because it's free, >> but their job only uses 1 MB of RAM). I'm also not sure if I want to >> give up using job_submit.lua 100% by switching job_submit_plugin to >> "partition" >> >> My ultimate goal is to have users specify what resources they need >> without specifying a QOS or Partition,and let Slurm handle that >> automatically based on the weights I assign to the nodes and >> partitions. I also don't want to lock a job to a specific partition >> at submit time so Slurm can allocate it to idle nodes in a different >> partition of that partition has idle nodes when the job is finally >> eligible to run. >> >> What is the best way to achieve my goals? All suggestions will be >> considered. >> >> For those of you who made it this far, thanks! >> >> Prentice >> >> >> >