Hi Prentice,

Have you considered Slurm features and constraints at all? You provide 
features (arbitrary strings in your slurm.conf) of what your hardware 
can provide ("amd", "ib", "FAST", "whatever"). A user then will list 
constraints using typical and/or/regex notation ( --constraint=amd&ib ). 
You may override or autofill constraint defaults yourself in your 
job_submit.lua.

Another level: you may also create your own Slurm arguments to sbatch or 
srun using SPANK plugins. These could be used to simplify a constraint 
list in whatever way you might see fit (e.g. sbatch --fast equates to 
--constraint=amd&ib&FAST ).

So, as a possibility, keep all nodes in one partition, supply the 
features in slurm.conf, have job_sumbit.lua give a default set of 
constraints (and/or force the user to provide a minimum set), create 
another partition that includes all the nodes as well but is 
preemptable/VIP/whatever (PriorityTiers work nice here too).

There are several ways to approach this and I imagine you really wish 
the users to be able to "just submit" with a minimum of effort and 
information on their part while your life is also manageable for changes 
or updates. I find the logic of the feature/constraint system to be 
quite elegant for meeting complex needs of heterogeneous systems.

Best,

Cyrus

On 1/22/19 2:49 PM, Prentice Bisbal wrote:
> I left out a a *very* critical detail: One of the reasons I'm looking 
> at revamping my Slurm configuration is that my users have requested 
> the capability to submit long-running, low-priority interruptible jobs 
> that can be killed and requeued when shorter-running, higher-priority 
> jobs need to use the resources.
>
> Prentice Bisbal
> Lead Software Engineer
> Princeton Plasma Physics Laboratory
> http://www.pppl.gov
>
> On 1/22/19 3:38 PM, Prentice Bisbal wrote:
>> Slurm Users,
>>
>> I would like your input on the best way to configure Slurm for a 
>> heterogeneous cluster I am responsible for. This e-mail will probably 
>> be a bit long to include all the necessary details of my environment 
>> so thanks in advance to those of you who read all of it!
>>
>> The cluster I support is a very heterogeneous cluster with several 
>> different network technologies and generations of processors. 
>> Although some people here refer to this cluster as numerous l 
>> different clusters, in reality it is one cluster, since all the nodes 
>> have their work assigned to them from a single Slurm Controller, all 
>> the nodes use the same executables installed on a shared drive, and 
>> all nodes are diskless and use the same NFSroot OS image, so they are 
>> all configured 100% alike.
>>
>> The cluster has been built piece-meal over a number of years, which 
>> explains the variety of hardware/networking in use. In Slurm, each of 
>> the different "clusters" is a separate partition intended to serve 
>> different purposes:
>>
>> Partition "E" - AMD Opteron 6320 processors, 64 GB RAM/node, 1 GbE, 
>> meant for serial, and low task count parallel jobs that only use a 
>> few cores and stay within a single node. Limited to 16 tasks or less 
>> in QOS
>>
>> Partition "D" - AMD Opteron 6136, 6274, and 6376 processors, 32 GB or 
>> 64 GB RAM per node, 10 GbE, meant for general-purpose parallel jobs 
>> spanning multiple nodes. Min. Task count of 32 tasks to prevent 
>> smaller jobs that should be run on Partition E from running here.
>>
>> Partition "K"  - AMD Opteron 6274 and 6376 processors, 64 GB RAM per 
>> node, DDR IB network, meant for tightly-coupled parallel jobs
>>
>> Partition "G1" - AMD Opteron 6274, 6276, 6376, and Intel Xeon E5-2698 
>> v3 &  E5-2630 v3 processors, RAM ranging from 128 GB - 512 GB per 
>> node, 1 GbE Network, meant for "large memory" jobs - some nodes are 
>> in different racks attached to different switches, so not really 
>> optimal for multi-node jobs.
>>
>> Partition "J" -  AMD Opteron 6136 Processors, 280 GB RAM per node, 
>> DDR IB, was orginally meant for a specific project, I now need to 
>> allow general access to it.
>>
>> Partition "G2" - AMD Opteron 6136, 6274, and 6320 processors, 32 GB, 
>> 96 GB, and 128 GB RAM per node, IB network , access is restricted to 
>> specific users/projects.
>>
>> Partition "M" - Intel Xeon E5-2698 v3 and E5-2697A v4 processors, 128 
>> GB RAM per node, 1 GbE network, reserved for running 1 specific 
>> simulation application.
>>
>> To make all this work so far, I have created a job_submit.lua script 
>> with numerous checks and conditionals that has become quite unwieldy. 
>> As a result, changes that should be simple take a considerable amount 
>> of time for me to rewrite and test the script. On top of that, almost 
>> all of the logic in that script is logic that Slurm can already 
>> perform in a more easily manageable way. I've essentially re-invented 
>> wheels that Slurm already provides.
>>
>> Further, each partition has it's own QOS, so my job_submit.lua 
>> assigns each job to a specific partition and QOS depending on it's 
>> resource requirements. This means that a job may be assigned to D, 
>> but could  also run on K if K is idle , will never be able to run on 
>> K. This means cluster nodes could go unutilized, reducing cluster 
>> utilization states (which management looks at), and increasing job 
>> queue time (which users are obsessed with).
>>
>> I would like to simplify this configuration as much as possible to 
>> reduce the labor it takes me to maintain my job_submit.lua script, 
>> and therefore make me more responsive to meeting my users needs, and 
>> increase cluster utilization. Since I have numerous different 
>> networks, I was thinking the I could use the topology,conf file to 
>> keep jobs on a single network, and prevent multi-node jobs run on 
>> partition E.  The partitions reserved for specific 
>> projects/departments would still need to be requested explicitly.
>>
>> At first, I was going to take this approach:
>>
>> 1. Create a single partition with all the general access nodes
>>
>> 2. Create a topology.conf file to make sure jobs stay within a single 
>> network.
>>
>> 3. Assign weights to the different partitions to that Slurm will try 
>> to assign jobs to them in a specific order of preference
>>
>> 4. Assign weights to the different nodes, so that the nodes with the 
>> fastest processors are preferred.
>>
>> After getting responses to my questions about the topology.conf file, 
>> this seems like this approach may not be viable, or at least not be 
>> best procedure.
>>
>> I'm am now considering this:
>>
>> 0. Restrict access to the non-general access partitions (this is 
>> already done for the most part, hence step 0).
>>
>> 1. Assign each Partition it's own QOS in the slurm.conf file.
>>
>> 2. Assign a weight to the partitions so Slurm attempts to assign jobs 
>> to them in a specific order.
>>
>> 3. Assign weights to the nodes so the nodes are assigned in a 
>> specific order (faster processors first)
>>
>> 4. Set job_submit plugin to all_partitions, or partition
>>
>>
>> Step 4 in this case is the area I'm the least familiar with. One of 
>> the reasons we are using a job_submit.lua script is because users 
>> will often request partitions that are inappropriate for their job 
>> needs (like trying to run a job that spans multiple nodes on a 
>> partition with only 1 GbE, or request partition G because it's free, 
>> but their job only uses 1 MB of RAM). I'm also not sure if I want to 
>> give up using job_submit.lua 100%  by switching job_submit_plugin to 
>> "partition"
>>
>> My ultimate goal is to have users specify what resources they need 
>> without specifying a QOS or Partition,and let Slurm handle that 
>> automatically based on the weights I assign to the nodes and 
>> partitions.  I also don't want to lock a job to a specific partition 
>> at submit time so Slurm can allocate it to idle nodes in a different 
>> partition of that partition has idle nodes when the job is finally 
>> eligible to run.
>>
>> What is the best way to achieve my goals? All suggestions will be 
>> considered.
>>
>> For those of you who made it this far, thanks!
>>
>> Prentice
>>
>>
>>
>

Reply via email to