Thank you all for your answers, I will research some more along these lines!
Any other opinion is welcome
Regards,
Antonio
El 11/05/18 a las 16:05, Vicker, Darby (JSC-EG311) escribió:
I’ll second that – we have a cluster with 4 generations of nodes. We
assign a processor type feature to each node and require the users to
ask for at least one of those features in their jobs via
job_submit.lua – see the code below. For a job that can run on any
processor type, you can use this:
#SBATCH --constraint=[wes|san|has|bro]
See the constraint section of “man sbatch” for more details but this
will constrain the job to any processor type but all nodes of one
type. It really works great from a utilization standpoint – jobs will
run on the first processor type that is free.
local feature_count = 0
if job_desc ~= nil and job_desc.features ~= nil then
if string.match(job_desc.features, "wes") then
feature_count=feature_count+1 end
if string.match(job_desc.features, "san") then
feature_count=feature_count+1 end
if string.match(job_desc.features, "has") then
feature_count=feature_count+1 end
if string.match(job_desc.features, "bro") then
feature_count=feature_count+1 end
end
if feature_count > 0 then
slurm.log_info("Found %s valid cpu features",feature_count)
else
slurm.log_user("Invalid features - aerolab policy requires specifying
one or more of wes,san,has,bro.")
slurm.log_error("Found %s cpu features from %s",feature_count,submit_uid)
-- See slurm/slurm_errno.h and src/common/slurm_errno.c
-- for the list of error codes and messages.
return 2002
end
Of course, the user can leave off the square brackets and get any mix
of processor types. We have some codes that run fine across different
processor types so we allow this. Our group is small enough that we
can easily educate and police the users to do the right thing. But
you could add more logic to job_submit.lua to require the brackets if
you wanted to.
Darby
*From: *slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf
of Hadrian Djohari <hx...@case.edu>
*Reply-To: *Slurm User Community List <slurm-users@lists.schedmd.com>
*Date: *Friday, May 11, 2018 at 5:22 AM
*To: *Slurm User Community List <slurm-users@lists.schedmd.com>
*Cc: *"slurm-us...@schedmd.com" <slurm-us...@schedmd.com>
*Subject: *Re: [slurm-users] Distribute jobs in similar nodes in the
same partition
You can use node feature in defining the node types in slurm.conf.
Then when requesting for the job, use -C <feature name> toy just use
those node type.
On Fri, May 11, 2018, 5:38 AM Antonio Lara <antonio.l...@uam.es
<mailto:antonio.l...@uam.es>> wrote:
Hello everyone,
Hopefully someone can help me with this, I cannot find in the
manual if
this is even possible:
I'm a system administrator, and the following question is from the
administrator point of view, not the user's point of view:
I work with a cluster which has a partition containing many nodes.
These
nodes belong to "different categories". This is, we bought at once
several machines that are of the same type, and we did this several
times. So, for example, we have 10 machines of type A, 20 machines of
type B and 15 machines of type C. Machines of type A are more
powerful
than machines of type B, which are more powerful than machines of
type C.
What I am trying to achieve is that Slurm "forces" parallelized
jobs to
be allocated in machines of the same type, if possible. That is, that
there is some type of priority which tries to allocate only
machines of
type A, or only machines of type B, or only of type C, and only
distribute jobs among machines of different types when there are not
enough nodes of the same type available.
Does anyone know if this is possible? The idea behind this is that
slower machines are not delaying the calculations in faster machines
when a job is distributed among them, and all machines work more
or less
at the same pace.
I've been told that It is NOT an option to create different
partitions,
each containing only one type of machine.
Please, note that I'm not looking for a way to choose as a user which
nodes to use for a job, what I need is that slurm does that, and
decides
what nodes to use, using similar nodes if available.
The closest that I could find in the manual was using consumable
resources, but I think this is not what I need, there are several
examples, but they don't seem to fit with this.
Thank you for your help!