[slurm-users] spreading jobs out across the cluster
I'm currently testing a new slurm setup before converting an existing pbs/torque grid over. Right now I've got 8 nodes in one partition, 48 cores on each. There's a second partition of older systems configured as 4 core nodes so the users can run some serial jobs. During some testing I've noticed that jobs always seem to take the nodes in a top down fashion. If I queue up a bunch of 3 node jobs they take nodes 1, 2 and 3 for one job, 4,5 and 6 for another. Nodes 7 and 8 never get used. I'd like to have slurm spread the jobs out across the nodes in a round robin fashion or even randomly. My config is really basic right now, I'm using defaults for most everything. Which settings could get the jobs spread out across the nodes in each partition a bit more fairly? -- Stephen Berg, IT Specialist, Ocean Sciences Division, Code 7309 Naval Research Laboratory W: (228) 688-5738 DSN: (312) 823-5738 C: (228) 365-0162 Email: stephen.b...@nrlssc.navy.mil <- (Preferred contact) Flank Speed: stephen.p.berg@us.navy.mil smime.p7s Description: S/MIME Cryptographic Signature
Re: [slurm-users] spreading jobs out across the cluster
Hi Stephen, "Stephen Berg, Code 7309" writes: > I'm currently testing a new slurm setup before converting an existing > pbs/torque grid over. Right now I've got 8 nodes in one partition, 48 > cores on each. There's a second partition of older systems configured > as 4 core nodes so the users can run some serial jobs. > > During some testing I've noticed that jobs always seem to take the > nodes in a top down fashion. If I queue up a bunch of 3 node jobs > they take nodes 1, 2 and 3 for one job, 4,5 and 6 for another. Nodes 7 > and 8 never get used. I'd like to have slurm spread the jobs out > across the nodes in a round robin fashion or even randomly. My config > is really basic right now, I'm using defaults for most everything. > > Which settings could get the jobs spread out across the nodes in each > partition a bit more fairly? You can set LLN for "least loaded nodes" in the configuration of the partition (see 'man slurm.conf') However, this is often not what you want. If you maximise the number of nodes in use, you won't be able to save energy by powering down nodes which are not required. What is your use-case for wanting to spread the jobs out? Cheers, Loris -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin
[slurm-users] Aborting a job from inside the prolog
Hi, We are doing some checking on the users Job inside the prolog script and upon failure of those checks the job should be canceled. Our first approach with `scancel $SLURM_JOB_ID; exit 1` doesn't seem to work as the (sbatch) job still gets re-queued. Is this possible at all (i.e. prevent jobs from running if some check fails) and what would be correct? Thanks, Alex smime.p7s Description: S/MIME Cryptographic Signature
[slurm-users] Disable --no-allocate support for a node/SlurmD
Hi, we do some additional checking on a user and the batch script in a Prolog script. However the `--no-allocate`/`-Z` bypasses allocation and hence execution of the Prolog/Epilog. Is there a way to configure SlurmD to deny access to jobs without allocations or more generally all interactive jobs? I know that only specific users are allowed to use `-Z` but disallowing circumventing the Prolog on a specific node would provide some additional safety as now that node would need to be breached first. Thanks, Alex smime.p7s Description: S/MIME Cryptographic Signature
Re: [slurm-users] Disable --no-allocate support for a node/SlurmD
Hello Alex, I'd suggest taking a look at Slurm's Lua plugins for these kind of problems: https://slurm.schedmd.com/cli_filter_plugins.html https://slurm.schedmd.com/job_submit_plugins.html As far as I understand it, cli_filter.lua is geared towards controlling the use of specific commandline options, like the --no-allocate you mentioned (and the cli_filter.lua.example available in the Slurm sources shows how one can forbid the use of `srun --pty` - a classic way to start interactive jobs - for anyone except root). job_submit.lua allows you to view (and edit!) all job parameters that are known at submit time, including the option to refuse a configuration by returning `slurm.ERROR`instead of `slurm.SUCCESS`. The common way to filter for interactive jobs in job_submit.lua is checking whether job_desc.script is nil or an empty string (i.e. the job submission doesn't have a script attached to it). You can do a lot more within job_submit.lua - I know of multiple sites (including the cluster I'm maintaining) that use it to, for example, automatically sort jobs into the correct partition(s) according to their resource requirements. All in all, these two interfaces are (imho) much better suited for the kind of task you're suggesting (checking job parameters, refusing specific job configurations) than prolog scripts, since technically by the time the prolog scripts are starting, the job configuration has already been finalized and accepted by the scheduler. Kind regards, René Sitt Am 14.06.23 um 15:03 schrieb Alexander Grund: Hi, we do some additional checking on a user and the batch script in a Prolog script. However the `--no-allocate`/`-Z` bypasses allocation and hence execution of the Prolog/Epilog. Is there a way to configure SlurmD to deny access to jobs without allocations or more generally all interactive jobs? I know that only specific users are allowed to use `-Z` but disallowing circumventing the Prolog on a specific node would provide some additional safety as now that node would need to be breached first. Thanks, Alex -- Dipl.-Chem. René Sitt Hessisches Kompetenzzentrum für Hochleistungsrechnen Philipps-Universität Marburg Hans-Meerwein-Straße 35032 Marburg Tel. +49 6421 28 23523 si...@hrz.uni-marburg.de www.hkhlr.de smime.p7s Description: S/MIME Cryptographic Signature
Re: [slurm-users] Disable --no-allocate support for a node/SlurmD
job_submit.lua allows you to view (and edit!) all job parameters that are known at submit time, including the option to refuse a configuration by returning `slurm.ERROR`instead of `slurm.SUCCESS`. The common way to filter for interactive jobs in job_submit.lua is checking whether job_desc.script is nil or an empty string (i.e. the job submission doesn't have a script attached to it). You can do a lot more within job_submit.lua - I know of multiple sites (including the cluster I'm maintaining) that use it to, for example, automatically sort jobs into the correct partition(s) according to their resource requirements. Thanks for the suggestion. However as I understand it this requires additionally trusting the node where those scripts are running on, which is, I guess, the one running SlurmCtlD. All in all, these two interfaces are (imho) much better suited for the kind of task you're suggesting (checking job parameters, refusing specific job configurations) than prolog scripts, since technically by the time the prolog scripts are starting, the job configuration has already been finalized and accepted by the scheduler. The reason we are using Prolog scripts is that they are running on the very node the job will be running on. So we make that one "secure" (or at least harden it by e.g. disabling SSH access and restricting any other connections). Then anything running on this node has a high trust level, e.g. the SlurmD and the Prolog script. If required the node could be rebooted with a fixed image after each job removing any potential compromise. That isn't feasible for the SlurmCtlD as that would affect the whole cluster and unrelated jobs. Hence the checks (for example filtering out interactive jobs, but also some additional authentication) should be done on the hardened node(s). It would work if there wasn't a way to circumvent the Prolog. So ideally I'd like to have a configuration option for the SlurmD such that it doesn't accept such jobs. As the SlurmD config is on the node it can also be considered secure. So while I fully agree that those plugins are better suited and likely easier to use I fear that it is much easier to prevent them from running and hence bypass those restrictions than having something (local) at the level of the SlurmD. Please correct me if I misunderstood anything. Kind Regards, Alexander Grund
Re: [slurm-users] Disable --no-allocate support for a node/SlurmD
Hi, Thanks for the suggestion. However as I understand it this requires additionally trusting the node where those scripts are running on, which is, I guess, the one running SlurmCtlD. The reason we are using Prolog scripts is that they are running on the very node the job will be running on. So we make that one "secure" (or at least harden it by e.g. disabling SSH access and restricting any other connections). Then anything running on this node has a high trust level, e.g. the SlurmD and the Prolog script. If required the node could be rebooted with a fixed image after each job removing any potential compromise. That isn't feasible for the SlurmCtlD as that would affect the whole cluster and unrelated jobs. Hence the checks (for example filtering out interactive jobs, but also some additional authentication) should be done on the hardened node(s). It would work if there wasn't a way to circumvent the Prolog. So ideally I'd like to have a configuration option for the SlurmD such that it doesn't accept such jobs. As the SlurmD config is on the node it can also be considered secure. So while I fully agree that those plugins are better suited and likely easier to use I fear that it is much easier to prevent them from running and hence bypass those restrictions than having something (local) at the level of the SlurmD. Please correct me if I misunderstood anything. Ah okay, so your requirements include completely insulating (some) jobs from outside access, including root? I've seen this kind of requirements on e.g. working non-defaced medical data - generally a tough problem imo because this level of data security seems more or less incompatible with the idea of a multi-user HPC system. I remember that this year's ZKI-AK Supercomputing spring meeting had Sebastian Krey from GWDG presenting the KISSKI ("KI-Servicezentrum für Sensible und Kritische Infrastrukturen", https://kisski.gwdg.de/ ) project, which works in this problem domain, are you involved in that? The setup with containerization and 'node hardening' sounds very similar to me. Re "preventing the scripts from running": I'd say it's about as easy as to otherwise manipulate any job submission that goes through slurmctld (e.g. by editing slurm.conf), so without knowing your exact use case and requirements, I can't think of a simple solution. Kind regards, René Sitt -- Dipl.-Chem. René Sitt Hessisches Kompetenzzentrum für Hochleistungsrechnen Philipps-Universität Marburg Hans-Meerwein-Straße 35032 Marburg Tel. +49 6421 28 23523 si...@hrz.uni-marburg.de www.hkhlr.de smime.p7s Description: S/MIME Cryptographic Signature
[slurm-users] trying to configure preemption partitions and also non-preemption with OverSubcribe=FORCE
The general idea is to have priority batch partitions with preemptions that can occur for higher priority jobs (suspending the lower priority). Also there's an interactive partition where users can run GUI tools that can't be preempted. This works fine up to the point that I would like to OverSubscribe=FORCE:2 on the interactive partition. Instead of seeing this do what I would hope, which is see 2x the number of single CPU jobs run on the interactive partition, the next job after 1x CPUs are allocated pends. Is it possible to have preemption turned on in general and still get OverSubscribe work the way it works w/o preemption on a partition with PreemptMode=OFF? If so I must be missing something in my configuration (see below). If not, why? Below is the details of my setup: kbr...@slm-dev.ba.rivosinc.com:~ via ✦2 ❯ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST low* up 14-00:00:0 2 idle cs44,cs1-dev medium up 14-00:00:0 2 idle cs44,cs1-dev high up 14-00:00:0 2 idle cs44,cs1-dev interactiveup 14-00:00:0 1 idle cs2-dev kbr...@slm-dev.ba.rivosinc.com:~ via ✦2 ❯ scontrol show partition interactive PartitionName=interactive AllowGroups=ALL AllowAccounts=rvs,gd1-dv AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=14-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=cs2-dev PriorityJobFactor=1 PriorityTier=100 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:2 OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=2 TotalNodes=1 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=400 MaxMemPerNode=UNLIMITED kbr...@slm-dev.ba.rivosinc.com:~ via ✦2 ❯ scontrol show config | grep Preempt PreemptMode = GANG,SUSPEND PreemptType = preempt/partition_prio PreemptExemptTime = 00:00:00 kbr...@slm-dev.ba.rivosinc.com:~ via ✦2 ❯ srun -p interactive sleep 600 & [5] 60490 kbr...@slm-dev.ba.rivosinc.com:~ via ✦3 ❯ srun -p interactive sleep 600 & [6] 60613 kbr...@slm-dev.ba.rivosinc.com:~ via ✦4 ❯ srun -p interactive sleep 600 & [7] 60696 srun: job 18919 queued and waiting for resources kbr...@slm-dev.ba.rivosinc.com:~ via ✦5 ❯ sq JOBID PARTITIO NAME USER ST TIME NODES CPU MIN_MEMO NODELIST(REASON) 18919 interactsleep kbroch PD 0:00 1 1 400M (Resources) 18917 interactsleep kbroch R 0:04 1 1 400M cs2-dev 18918 interactsleep kbroch R 0:04 1 1 400M cs2-dev Best, /