Re: [slurm-users] Mixing GPU Types on Same Node

2023-03-29 Thread Thomas M. Payerle
You can probably have a job submit lua script that looks at the --gpus flag (and maybe the --gres=gpu:* flag as well) and force a GPU type. A bit complicated, and not sure if it will catch srun submissions. I don't think this is flexible enough to ensure they get the least powerful GPU among all

[slurm-users] Mixing GPU Types on Same Node

2023-03-29 Thread collin.m.mccarthy
Hello, Apologies if this is in the docs but I couldn't find it anywhere. I've been using Slurm to run a small 7-node cluster in a research lab for a couple of years now (I'm a PhD student). A couple of our nodes have heterogenous GPU models. One in particular has quite a few: 2x NVIDIA A100s,

[slurm-users] LSF Wrappers for Slurm

2023-03-29 Thread Amir Ben Avi
Hi, Does anyone know if there are any LSF wrappers ( bsub, bjobs , bkill etc ) that can work in Slurm ? What I found so far is table that convert LSF command to Slurm command. Any info will be appreciated Thanks, Amir

[slurm-users] Using JSON/YAML to describe jobs for submission to SLURM

2023-03-29 Thread Nicholas Yue
Hi, I am looking at parsing some data and submitting lots of jobs to SLURM and was wondering if there is a way to describe all the jobs and their dependencies in some JSON file and submit that JSON file instead of making individual calls to SLURM ? Cheers -- Nicholas Yue https://www.linkedin.c

Re: [slurm-users] error: power_save module disabled, NULL SuspendProgram

2023-03-29 Thread Ole Holm Nielsen
Hi Thomas, I think the Slurm power_save is not problematic for us with bare-metal on-premise nodes, in contrast to the problems you're having. We use power_save with on-premise nodes where we control the power down/up by means of IPMI commands as provided in the scripts which you will find i

Re: [slurm-users] error: power_save module disabled, NULL SuspendProgram

2023-03-29 Thread Dr. Thomas Orgis
Am Wed, 29 Mar 2023 14:42:33 +0200 schrieb Ben Polman : > I'd be interested in your kludge, we face a similar situation where the > slurmctld node > does not have access to the ipmi network and can not ssh to machines > that have access. > We are thinking on creating a rest interface to a contro

Re: [slurm-users] error: power_save module disabled, NULL SuspendProgram

2023-03-29 Thread Ben Polman
I'd be interested in your kludge, we face a similar situation where the slurmctld node does not have access to the ipmi network and can not ssh to machines that have access. We are thinking on creating a rest interface to a control server which would be running the ipmi commands Ben On 29-

Re: [slurm-users] error: power_save module disabled, NULL SuspendProgram

2023-03-29 Thread Dr. Thomas Orgis
Am Mon, 27 Mar 2023 13:17:01 +0200 schrieb Ole Holm Nielsen : > FYI: Slurm power_save works very well for us without the issues that you > describe below. We run Slurm 22.05.8, what's your version? I'm sure that there are setups where it works nicely;-) For us, it didn't, and I was faced with h

Re: [slurm-users] Keep CPU Jobs Off GPU Nodes

2023-03-29 Thread Markus Kötter
Hello, On 29.03.23 10:08, René Sitt wrote: While the cited procedure works great in general, it gets more complicated for heterogeneous setups , i.e. if you have several GPU types defined in gres.conf, since the 'tres_per_' fields can then take the form of either 'gres:gpu:N' or 'gres:gpu::N'

Re: [slurm-users] Keep CPU Jobs Off GPU Nodes

2023-03-29 Thread René Sitt
Hello, maybe some additional notes: While the cited procedure works great in general, it gets more complicated for heterogeneous setups, i.e. if you have several GPU types defined in gres.conf, since the 'tres_per_' fields can then take the form of either 'gres:gpu:N' or 'gres:gpu::N' - depen

Re: [slurm-users] Keep CPU Jobs Off GPU Nodes

2023-03-29 Thread Wagner, Marcus
Hi Frank, use Features on the nodes, every cpu node gets e.g. "cpu", every gpu node e.g. "gpu". If a job asks for no gpus, set an additional constraint "cpu" for the job. Best Marcus Am 29.03.2023 um 01:24 schrieb Frank Pari: Well, I wanted to avoid using lua.  But, it looks like that's goi

Re: [slurm-users] Keep CPU Jobs Off GPU Nodes

2023-03-29 Thread Ward Poelmans
Hi, We have a dedicated partitions for GPUs (their name ends with _gpu) and simply forbid a job that is not requesting GPU resources to use this partition: local function job_total_gpus(job_desc) -- return total number of GPUs allocated to the job -- there are many ways to request a GPU