Re: [slurm-users] Distribute jobs in similar nodes in the same partition

Marcus Wagner Tue, 15 May 2018 05:16:41 -0700

Hi Antonio,

if you don't care, which nodes are used, but want to ensure that onlyuniform nodes are used, you could also use the topology/tree plugin,where you define on "switch" for every node type and then use sbatch--switches=1.


Best
Marcus

On 05/11/2018 05:49 PM, Antonio Lara wrote:

Thank you all for your answers, I will research some more along theselines!
Any other opinion is welcome

Regards,

Antonio


El 11/05/18 a las 16:05, Vicker, Darby (JSC-EG311) escribió:
I’ll second that – we have a cluster with 4 generations of nodes. Weassign a processor type feature to each node and require the users toask for at least one of those features in their jobs viajob_submit.lua – see the code below. For a job that can run on anyprocessor type, you can use this:
#SBATCH --constraint=[wes|san|has|bro]
See the constraint section of “man sbatch” for more details but thiswill constrain the job to any processor type but all nodes of onetype. It really works great from a utilization standpoint – jobswill run on the first processor type that is free.
local feature_count = 0

   if job_desc ~= nil and job_desc.features ~= nil then
if string.match(job_desc.features, "wes") thenfeature_count=feature_count+1 end
if string.match(job_desc.features, "san") thenfeature_count=feature_count+1 end
if string.match(job_desc.features, "has") thenfeature_count=feature_count+1 end
if string.match(job_desc.features, "bro") thenfeature_count=feature_count+1 end
   end

   if feature_count > 0 then

slurm.log_info("Found %s valid cpu features",feature_count)

   else
slurm.log_user("Invalid features - aerolab policy requires specifyingone or more of wes,san,has,bro.")
slurm.log_error("Found %s cpu features from %s",feature_count,submit_uid)

-- See slurm/slurm_errno.h and src/common/slurm_errno.c

-- for the list of error codes and messages.

return 2002

   end
Of course, the user can leave off the square brackets and get any mixof processor types. We have some codes that run fine acrossdifferent processor types so we allow this. Our group is smallenough that we can easily educate and police the users to do theright thing. But you could add more logic to job_submit.lua torequire the brackets if you wanted to.
Darby
*From: *slurm-users <slurm-users-boun...@lists.schedmd.com> on behalfof Hadrian Djohari <hx...@case.edu>
*Reply-To: *Slurm User Community List <slurm-users@lists.schedmd.com>
*Date: *Friday, May 11, 2018 at 5:22 AM
*To: *Slurm User Community List <slurm-users@lists.schedmd.com>
*Cc: *"slurm-us...@schedmd.com" <slurm-us...@schedmd.com>
*Subject: *Re: [slurm-users] Distribute jobs in similar nodes in thesame partition
You can use node feature in defining the node types in slurm.conf.
Then when requesting for the job, use -C <feature name> toy just usethose node type.
On Fri, May 11, 2018, 5:38 AM Antonio Lara <antonio.l...@uam.es<mailto:antonio.l...@uam.es>> wrote:
    Hello everyone,

    Hopefully someone can help me with this, I cannot find in the
    manual if
    this is even possible:

    I'm a system administrator, and the following question is from the
    administrator point of view, not the user's point of view:

    I work with a cluster which has a partition containing many
    nodes. These
    nodes belong to "different categories". This is, we bought at once
    several machines that are of the same type, and we did this several
    times. So, for example, we have 10 machines of type A, 20
    machines of
    type B and 15 machines of type C. Machines of type A are more
    powerful
    than machines of type B, which are more powerful than machines of
    type C.

    What I am trying to achieve is that Slurm "forces" parallelized
    jobs to
    be allocated in machines of the same type, if possible. That is,
    that
    there is some type of priority which tries to allocate only
    machines of
    type A, or only machines of type B, or only of type C, and only
    distribute jobs among machines of different types when there are not
    enough nodes of the same type available.

    Does anyone know if this is possible? The idea behind this is that
    slower machines are not delaying the calculations in faster machines
    when a job is distributed among them, and all machines work more
    or less
    at the same pace.

    I've been told that It is NOT an option to create different
    partitions,
    each containing only one type of machine.

    Please, note that I'm not looking for a way to choose as a user
    which
    nodes to use for a job, what I need is that slurm does that, and
    decides
    what nodes to use, using similar nodes if available.

    The closest that I could find in the manual was using consumable
    resources, but I think this is not what I need, there are several
    examples, but they don't seem to fit with this.

    Thank you for your help!


--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de

Re: [slurm-users] Distribute jobs in similar nodes in the same partition

Reply via email to