[slurm-users] Configuration recommendations for heterogeneous cluster

Prentice Bisbal Tue, 22 Jan 2019 12:40:24 -0800

Slurm Users,

I would like your input on the best way to configure Slurm for aheterogeneous cluster I am responsible for. This e-mail will probably bea bit long to include all the necessary details of my environment sothanks in advance to those of you who read all of it!

The cluster I support is a very heterogeneous cluster with severaldifferent network technologies and generations of processors. Althoughsome people here refer to this cluster as numerous l different clusters,in reality it is one cluster, since all the nodes have their workassigned to them from a single Slurm Controller, all the nodes use thesame executables installed on a shared drive, and all nodes are disklessand use the same NFSroot OS image, so they are all configured 100% alike.

The cluster has been built piece-meal over a number of years, whichexplains the variety of hardware/networking in use. In Slurm, each ofthe different "clusters" is a separate partition intended to servedifferent purposes:

Partition "E" - AMD Opteron 6320 processors, 64 GB RAM/node, 1 GbE,meant for serial, and low task count parallel jobs that only use a fewcores and stay within a single node. Limited to 16 tasks or less in QOS

Partition "D" - AMD Opteron 6136, 6274, and 6376 processors, 32 GB or 64GB RAM per node, 10 GbE, meant for general-purpose parallel jobsspanning multiple nodes. Min. Task count of 32 tasks to prevent smallerjobs that should be run on Partition E from running here.

Partition "K" - AMD Opteron 6274 and 6376 processors, 64 GB RAM pernode, DDR IB network, meant for tightly-coupled parallel jobs

Partition "G1" - AMD Opteron 6274, 6276, 6376, and Intel Xeon E5-2698 v3& E5-2630 v3 processors, RAM ranging from 128 GB - 512 GB per node, 1GbE Network, meant for "large memory" jobs - some nodes are in differentracks attached to different switches, so not really optimal formulti-node jobs.

Partition "J" - AMD Opteron 6136 Processors, 280 GB RAM per node, DDRIB, was orginally meant for a specific project, I now need to allowgeneral access to it.

Partition "G2" - AMD Opteron 6136, 6274, and 6320 processors, 32 GB, 96GB, and 128 GB RAM per node, IB network , access is restricted tospecific users/projects.

Partition "M" - Intel Xeon E5-2698 v3 and E5-2697A v4 processors, 128GB RAM per node, 1 GbE network, reserved for running 1 specificsimulation application.

To make all this work so far, I have created a job_submit.lua scriptwith numerous checks and conditionals that has become quite unwieldy. Asa result, changes that should be simple take a considerable amount oftime for me to rewrite and test the script. On top of that, almost allof the logic in that script is logic that Slurm can already perform in amore easily manageable way. I've essentially re-invented wheels thatSlurm already provides.

Further, each partition has it's own QOS, so my job_submit.lua assignseach job to a specific partition and QOS depending on it's resourcerequirements. This means that a job may be assigned to D, but could also run on K if K is idle , will never be able to run on K. This meanscluster nodes could go unutilized, reducing cluster utilization states(which management looks at), and increasing job queue time (which usersare obsessed with).

I would like to simplify this configuration as much as possible toreduce the labor it takes me to maintain my job_submit.lua script, andtherefore make me more responsive to meeting my users needs, andincrease cluster utilization. Since I have numerous different networks,I was thinking the I could use the topology,conf file to keep jobs on asingle network, and prevent multi-node jobs run on partition E. Thepartitions reserved for specific projects/departments would still needto be requested explicitly.


At first, I was going to take this approach:

1. Create a single partition with all the general access nodes

2. Create a topology.conf file to make sure jobs stay within a singlenetwork.

3. Assign weights to the different partitions to that Slurm will try toassign jobs to them in a specific order of preference

4. Assign weights to the different nodes, so that the nodes with thefastest processors are preferred.

After getting responses to my questions about the topology.conf file,this seems like this approach may not be viable, or at least not be bestprocedure.


I'm am now considering this:

0. Restrict access to the non-general access partitions (this is alreadydone for the most part, hence step 0).


1. Assign each Partition it's own QOS in the slurm.conf file.

2. Assign a weight to the partitions so Slurm attempts to assign jobs tothem in a specific order.

3. Assign weights to the nodes so the nodes are assigned in a specificorder (faster processors first)


4. Set job_submit plugin to all_partitions, or partition

Step 4 in this case is the area I'm the least familiar with. One of thereasons we are using a job_submit.lua script is because users will oftenrequest partitions that are inappropriate for their job needs (liketrying to run a job that spans multiple nodes on a partition with only 1GbE, or request partition G because it's free, but their job only uses 1MB of RAM). I'm also not sure if I want to give up using job_submit.lua100% by switching job_submit_plugin to "partition"

My ultimate goal is to have users specify what resources they needwithout specifying a QOS or Partition,and let Slurm handle thatautomatically based on the weights I assign to the nodes andpartitions. I also don't want to lock a job to a specific partition atsubmit time so Slurm can allocate it to idle nodes in a differentpartition of that partition has idle nodes when the job is finallyeligible to run.

What is the best way to achieve my goals? All suggestions will beconsidered.


For those of you who made it this far, thanks!

Prentice



--
Prentice

[slurm-users] Configuration recommendations for heterogeneous cluster

Reply via email to