[slurm-users] spreading jobs out across the cluster

2023-06-14 Thread Stephen Berg, Code 7309
I'm currently testing a new slurm setup before converting an existing 
pbs/torque grid over.  Right now I've got 8 nodes in one partition, 48 
cores on each.  There's a second partition of older systems configured 
as 4 core nodes so the users can run some serial jobs.


During some testing I've noticed that jobs always seem to take the nodes 
in a top down fashion.  If I queue up a bunch of 3 node jobs they take 
nodes 1, 2 and 3 for one job, 4,5 and 6 for another. Nodes 7 and 8 never 
get used.  I'd like to have slurm spread the jobs out across the nodes 
in a round robin fashion or even randomly.  My config is really basic 
right now, I'm using defaults for most everything.


Which settings could get the jobs spread out across the nodes in each 
partition a bit more fairly?


--
Stephen Berg, IT Specialist, Ocean Sciences Division, Code 7309
Naval Research Laboratory
W:   (228) 688-5738
DSN: (312) 823-5738
C:   (228) 365-0162
Email: stephen.b...@nrlssc.navy.mil  <- (Preferred contact)
Flank Speed: stephen.p.berg@us.navy.mil



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] spreading jobs out across the cluster

2023-06-14 Thread Loris Bennett
Hi Stephen,

"Stephen Berg, Code 7309"  writes:

> I'm currently testing a new slurm setup before converting an existing
> pbs/torque grid over.  Right now I've got 8 nodes in one partition, 48 
> cores on each.  There's a second partition of older systems configured
> as 4 core nodes so the users can run some serial jobs.
>
> During some testing I've noticed that jobs always seem to take the
> nodes in a top down fashion.  If I queue up a bunch of 3 node jobs
> they take nodes 1, 2 and 3 for one job, 4,5 and 6 for another. Nodes 7
> and 8 never get used.  I'd like to have slurm spread the jobs out
> across the nodes in a round robin fashion or even randomly.  My config
> is really basic right now, I'm using defaults for most everything.
>
> Which settings could get the jobs spread out across the nodes in each
> partition a bit more fairly?

You can set 

  LLN 

for "least loaded nodes" in the configuration of the partition (see 'man
slurm.conf')

However, this is often not what you want.  If you maximise the number
of nodes in use, you won't be able to save energy by powering down nodes
which are not required.  What is your use-case for wanting to spread the
jobs out?

Cheers,

Loris

-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin



[slurm-users] Aborting a job from inside the prolog

2023-06-14 Thread Alexander Grund

Hi,

We are doing some checking on the users Job inside the prolog script and 
upon failure of those checks the job should be canceled.


Our first approach with `scancel $SLURM_JOB_ID; exit 1` doesn't seem to 
work as the (sbatch) job still gets re-queued.


Is this possible at all (i.e. prevent jobs from running if some check 
fails) and what would be correct?


Thanks,
Alex



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-users] Disable --no-allocate support for a node/SlurmD

2023-06-14 Thread Alexander Grund

Hi,

we do some additional checking on a user and the batch script in a 
Prolog script.
However the `--no-allocate`/`-Z` bypasses allocation and hence execution 
of the Prolog/Epilog.


Is there a way to configure SlurmD to deny access to jobs without 
allocations or more generally all interactive jobs?


I know that only specific users are allowed to use `-Z` but disallowing 
circumventing the Prolog on a specific node would provide some 
additional safety as now that node would need to be breached first.


Thanks,
Alex



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Disable --no-allocate support for a node/SlurmD

2023-06-14 Thread René Sitt

Hello Alex,

I'd suggest taking a look at Slurm's Lua plugins for these kind of problems:

https://slurm.schedmd.com/cli_filter_plugins.html
https://slurm.schedmd.com/job_submit_plugins.html

As far as I understand it, cli_filter.lua is geared towards controlling 
the use of specific commandline options, like the --no-allocate you 
mentioned (and the cli_filter.lua.example available in the Slurm sources 
shows how one can forbid the use of `srun --pty` - a classic way to 
start interactive jobs - for anyone except root).


job_submit.lua allows you to view (and edit!) all job parameters that 
are known at submit time, including the option to refuse a configuration 
by returning `slurm.ERROR`instead of `slurm.SUCCESS`. The common way to 
filter for interactive jobs in job_submit.lua is checking whether 
job_desc.script is nil or an empty string (i.e. the job submission 
doesn't have a script attached to it). You can do a lot more within 
job_submit.lua - I know of multiple sites (including the cluster I'm 
maintaining) that use it to, for example, automatically sort jobs into 
the correct partition(s) according to their resource requirements.


All in all, these two interfaces are (imho) much better suited for the 
kind of task you're suggesting (checking job parameters, refusing 
specific job configurations) than prolog scripts, since technically by 
the time the prolog scripts are starting, the job configuration has 
already been finalized and accepted by the scheduler.


Kind regards,
René Sitt

Am 14.06.23 um 15:03 schrieb Alexander Grund:

Hi,

we do some additional checking on a user and the batch script in a 
Prolog script.
However the `--no-allocate`/`-Z` bypasses allocation and hence 
execution of the Prolog/Epilog.


Is there a way to configure SlurmD to deny access to jobs without 
allocations or more generally all interactive jobs?


I know that only specific users are allowed to use `-Z` but 
disallowing circumventing the Prolog on a specific node would provide 
some additional safety as now that node would need to be breached first.


Thanks,
Alex


--
Dipl.-Chem. René Sitt
Hessisches Kompetenzzentrum für Hochleistungsrechnen
Philipps-Universität Marburg
Hans-Meerwein-Straße
35032 Marburg

Tel. +49 6421 28 23523
si...@hrz.uni-marburg.de
www.hkhlr.de



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Disable --no-allocate support for a node/SlurmD

2023-06-14 Thread Alexander Grund

job_submit.lua allows you to view (and edit!) all job parameters that
are known at submit time, including the option to refuse a configuration
by returning `slurm.ERROR`instead of `slurm.SUCCESS`. The common way to
filter for interactive jobs in job_submit.lua is checking whether
job_desc.script is nil or an empty string (i.e. the job submission
doesn't have a script attached to it). You can do a lot more within
job_submit.lua - I know of multiple sites (including the cluster I'm
maintaining) that use it to, for example, automatically sort jobs into
the correct partition(s) according to their resource requirements.

Thanks for the suggestion.

However as I understand it this requires additionally trusting the node 
where those scripts are running on,

which is, I guess, the one running SlurmCtlD.


All in all, these two interfaces are (imho) much better suited for the
kind of task you're suggesting (checking job parameters, refusing
specific job configurations) than prolog scripts, since technically by
the time the prolog scripts are starting, the job configuration has
already been finalized and accepted by the scheduler.
The reason we are using Prolog scripts is that they are running on the 
very node the job will be running on.
So we make that one "secure" (or at least harden it by e.g. disabling 
SSH access and restricting any other connections).
Then anything running on this node has a high trust level, e.g. the 
SlurmD and the Prolog script.
If required the node could be rebooted with a fixed image after each job 
removing any potential compromise.
That isn't feasible for the SlurmCtlD as that would affect the whole 
cluster and unrelated jobs.


Hence the checks (for example filtering out interactive jobs, but also 
some additional authentication) should be done on the hardened node(s).


It would work if there wasn't a way to circumvent the Prolog. So ideally 
I'd like to have a configuration option for the SlurmD such that it 
doesn't accept such jobs.

As the SlurmD config is on the node it can also be considered secure.

So while I fully agree that those plugins are better suited and likely 
easier to use
I fear that it is much easier to prevent them from running and hence 
bypass those restrictions

than having something (local) at the level of the SlurmD.

Please correct me if I misunderstood anything.

Kind Regards,
Alexander Grund




Re: [slurm-users] Disable --no-allocate support for a node/SlurmD

2023-06-14 Thread René Sitt

Hi,


Thanks for the suggestion.

However as I understand it this requires additionally trusting the 
node where those scripts are running on,

which is, I guess, the one running SlurmCtlD.

The reason we are using Prolog scripts is that they are running on the 
very node the job will be running on.
So we make that one "secure" (or at least harden it by e.g. disabling 
SSH access and restricting any other connections).
Then anything running on this node has a high trust level, e.g. the 
SlurmD and the Prolog script.
If required the node could be rebooted with a fixed image after each 
job removing any potential compromise.
That isn't feasible for the SlurmCtlD as that would affect the whole 
cluster and unrelated jobs.


Hence the checks (for example filtering out interactive jobs, but also 
some additional authentication) should be done on the hardened node(s).


It would work if there wasn't a way to circumvent the Prolog. So 
ideally I'd like to have a configuration option for the SlurmD such 
that it doesn't accept such jobs.

As the SlurmD config is on the node it can also be considered secure.

So while I fully agree that those plugins are better suited and likely 
easier to use
I fear that it is much easier to prevent them from running and hence 
bypass those restrictions

than having something (local) at the level of the SlurmD.

Please correct me if I misunderstood anything.


Ah okay,  so your requirements include completely insulating (some) jobs 
from outside access, including root? I've seen this kind of requirements 
on e.g. working non-defaced medical data - generally a tough problem imo 
because this level of data security seems more or less incompatible with 
the idea of a multi-user HPC system.


I remember that this year's ZKI-AK Supercomputing spring meeting had 
Sebastian Krey from GWDG presenting the KISSKI ("KI-Servicezentrum für 
Sensible und Kritische Infrastrukturen", https://kisski.gwdg.de/ ) 
project, which works in this problem domain, are you involved in that? 
The setup with containerization and 'node hardening' sounds very similar 
to me.


Re "preventing the scripts from running": I'd say it's about as easy as 
to otherwise manipulate any job submission that goes through slurmctld 
(e.g. by editing slurm.conf), so without knowing your exact use case and 
requirements, I can't think of a simple solution.


Kind regards,
René Sitt

--
Dipl.-Chem. René Sitt
Hessisches Kompetenzzentrum für Hochleistungsrechnen
Philipps-Universität Marburg
Hans-Meerwein-Straße
35032 Marburg

Tel. +49 6421 28 23523
si...@hrz.uni-marburg.de
www.hkhlr.de



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-users] trying to configure preemption partitions and also non-preemption with OverSubcribe=FORCE

2023-06-14 Thread Kevin Broch
The general idea is to have priority batch partitions with preemptions that
can
occur for higher priority jobs (suspending the lower priority).
Also there's an interactive partition where users can run GUI tools that
can't be preempted.

This works fine up to the point that I would like to OverSubscribe=FORCE:2
on the interactive partition.
Instead of seeing this do what I would hope, which is see 2x the number of
single CPU jobs run on the
interactive partition, the next job after 1x CPUs are allocated pends.

Is it possible to have preemption turned on in general and still get
OverSubscribe work the way it works w/o preemption on a partition with
PreemptMode=OFF?
If so I must be missing something in my configuration (see below).  If not,
why?

Below is the details of my setup:

kbr...@slm-dev.ba.rivosinc.com:~ via 
✦2 ❯ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
low*   up 14-00:00:0  2   idle cs44,cs1-dev
medium up 14-00:00:0  2   idle cs44,cs1-dev
high   up 14-00:00:0  2   idle cs44,cs1-dev
interactiveup 14-00:00:0  1   idle cs2-dev

kbr...@slm-dev.ba.rivosinc.com:~ via 
✦2 ❯ scontrol show partition interactive
PartitionName=interactive
   AllowGroups=ALL AllowAccounts=rvs,gd1-dv AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
Hidden=NO
   MaxNodes=UNLIMITED MaxTime=14-00:00:00 MinNodes=0 LLN=NO
MaxCPUsPerNode=UNLIMITED
   Nodes=cs2-dev
   PriorityJobFactor=1 PriorityTier=100 RootOnly=NO ReqResv=NO
OverSubscribe=FORCE:2
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2 TotalNodes=1 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=400 MaxMemPerNode=UNLIMITED


kbr...@slm-dev.ba.rivosinc.com:~ via 
✦2 ❯ scontrol show config | grep Preempt
PreemptMode = GANG,SUSPEND
PreemptType = preempt/partition_prio
PreemptExemptTime   = 00:00:00

kbr...@slm-dev.ba.rivosinc.com:~ via 
✦2 ❯ srun -p interactive sleep 600 &
[5] 60490

kbr...@slm-dev.ba.rivosinc.com:~ via 
✦3 ❯ srun -p interactive sleep 600 &
[6] 60613

kbr...@slm-dev.ba.rivosinc.com:~ via 
✦4 ❯ srun -p interactive sleep 600 &
[7] 60696
srun: job 18919 queued and waiting for resources

kbr...@slm-dev.ba.rivosinc.com:~ via 
✦5 ❯ sq
 JOBID PARTITIO NAME USER ST
TIME  NODES CPU MIN_MEMO NODELIST(REASON)
 18919 interactsleep   kbroch PD
0:00  1   1 400M (Resources)
 18917 interactsleep   kbroch  R
0:04  1   1 400M cs2-dev
 18918 interactsleep   kbroch  R
0:04  1   1 400M cs2-dev

Best, /