Hi Hafedh,
Im no expert in the GPU side of SLURM, but looking at you current
configuration to me its working as intended at the moment. You have
defined 4 GPUs and start multiple jobs each consuming 4 GPUs each. So
the jobs wait for the ressource the be free again.
I think what you need to look into is the MPS plugin, which seems to do
what you are trying to achieve:
https://slurm.schedmd.com/gres.html#MPS_Management
Kind regards,
Matt
On 2024-01-18 12:53, Kherfani, Hafedh (Professional Services, TC) wrote:
Hello Experts,
I'm a new Slurm user (so please bare with me :) ...).
Recently we've deployed Slurm version 23.11 on a very simple cluster,
which consists of a Master node (acting as a Login & Slurmdbd node as
well), a Compute Node which has a NVIDIA HGX A100-SXM4-40GB GPU,
detected as 4 x GPU's: GPU [0-4], and a Storage Array
presenting/sharing the NFS disk (where users' home directories will be
created as well).
The problem is that I've never been able to run a simple/dummy batch
script in a parallel way using the 4 GPU's. In fact, running the same
command "sbatch gpu-job.sh" multiple times shows that only one single
job is running, while the other jobs are in a pending state:
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
Submitted batch job 214
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
Submitted batch job 215
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
Submitted batch job 216
[slurmtest@c-a100-master test-batch-scripts]$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
216 gpu gpu-job slurmtest PD 0:00 1
(None)
215 gpu gpu-job slurmtest PD 0:00 1
(Priority)
214 gpu gpu-job slurmtest PD 0:00 1
(Priority)
213 gpu gpu-job slurmtest PD 0:00 1
(Priority)
212 gpu gpu-job slurmtest PD 0:00 1
(Resources)
211 gpu gpu-job slurmtest R 0:14 1
c-a100-cn01
PS: CPU jobs (i.e. using the default debug partition, without call the
GPU Gres) can be run in parallel. The issue with running parallel jobs
is only seen when using the GPU's as Gres.
I've tried many combinations of settings in gres.conf and slurm.conf,
many (if not most) of these combinations would result in error
messages in slurmctld and slurmd logs.
The current gres.conf and slurm.conf contents is shown below. Even
though it doesn't give errors when restarting slurmctld and slurmd
services (on master and compute nodes, resp.), but as I said, it
doesn't allow jobs to be executed in parallel. Batch script contents
shared below as well, in order to give more clarity on what I'm trying
to do:
[root@c-a100-master slurm]# cat gres.conf | grep -v "^#"
NodeName=c-a100-cn01 AutoDetect=nvml Name=gpu Type=A100
File=/dev/nvidia[0-3]
[root@c-a100-master slurm]# cat slurm.conf | grep -v "^#" | egrep -i
"AccountingStorageTRES|GresTypes|NodeName|partition"
GresTypes=gpu
AccountingStorageTRES=gres/gpu
NodeName=c-a100-cn01 Gres=gpu:A100:4 CPUs=64 Boards=1
SocketsPerBoard=1 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=515181
State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
PartitionName=gpu Nodes=ALL MaxTime=10:0:0
[slurmtest@c-a100-master test-batch-scripts]$ cat gpu-job.sh
#!/bin/bash
#SBATCH --job-name=gpu-job
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --tasks-per-node=1
#SBATCH --output=gpu_job_output.%j # Output file name (replaces %j
with job ID)
#SBATCH --error=gpu_job_error.%j # Error file name (replaces %j
with job ID)
hostname
date
sleep 40
pwd
Any help on which changes need to be made to the config files (mainly
slurm.conf and gres.cong) and/or the batch script, so that multiple
jobs can be in a "Running" state at the same time (in parallel) ?
Thanks in advance for your help !
Best regards,
Hafedh Kherfani