+1 on checking the memory allocation. Or add/check if you have any DefMemPerX set in your slurm.conf
On Fri, Jan 19, 2024 at 12:33 AM mohammed shambakey <shambak...@gmail.com> wrote: > Hi > > I'm not an expert, but is it possible that the currently running jobs is > consuming the whole node because it is allocated the whole memory of the > node (so the other 2 jobs had to wait until it finishes)? > Maybe if you try to restrict the required memory for each job? > > Regards > > On Thu, Jan 18, 2024 at 4:46 PM Ümit Seren <uemit.se...@gmail.com> wrote: > >> This line also has tobe changed: >> >> >> #SBATCH --gpus-per-node=4 #SBATCH --gpus-per-node=1 >> >> --gpus-per-node seems to be the new parameter that is replacing the --gres= >> one, so you can remove the –gres line completely. >> >> >> >> Best >> >> Ümit >> >> >> >> *From: *slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of >> Kherfani, Hafedh (Professional Services, TC) <hafedh.kherf...@hpe.com> >> *Date: *Thursday, 18. January 2024 at 15:40 >> *To: *Slurm User Community List <slurm-users@lists.schedmd.com> >> *Subject: *Re: [slurm-users] Need help with running multiple >> instances/executions of a batch script in parallel (with NVIDIA HGX A100 >> GPU as a Gres) >> >> Hi Noam and Matthias, >> >> >> >> Thanks both for your answers. >> >> >> >> I changed the “#SBATCH --gres=gpu:4“ directive (in the batch script) with >> “#SBATCH --gres=gpu:1“ as you suggested, but it didn’t make a difference, >> as running this batch script 3 times will result in the first job to be in >> a running state, while the second and third jobs will still be in a pending >> state … >> >> >> >> [slurmtest@c-a100-master test-batch-scripts]$ cat gpu-job.sh >> >> #!/bin/bash >> >> #SBATCH --job-name=gpu-job >> >> #SBATCH --partition=gpu >> >> #SBATCH --nodes=1 >> >> #SBATCH --gpus-per-node=4 >> >> #SBATCH --gres=gpu:1 # <<<< Changed from ‘4’ >> to ‘1’ >> >> #SBATCH --tasks-per-node=1 >> >> #SBATCH --output=gpu_job_output.%j >> >> #SBATCH --error=gpu_job_error.%j >> >> >> >> hostname >> >> date >> >> sleep 40 >> >> pwd >> >> >> >> [slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh >> >> Submitted batch job *217* >> >> [slurmtest@c-a100-master test-batch-scripts]$ squeue >> >> JOBID PARTITION NAME USER ST TIME NODES >> NODELIST(REASON) >> >> 217 gpu gpu-job slurmtes R 0:02 1 >> c-a100-cn01 >> >> [slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh >> >> Submitted batch job *218* >> >> [slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh >> >> Submitted batch job *219* >> >> [slurmtest@c-a100-master test-batch-scripts]$ squeue >> >> JOBID PARTITION NAME USER ST TIME NODES >> NODELIST(REASON) >> >> 219 gpu gpu-job slurmtes *PD* 0:00 1 >> (Priority) >> >> 218 gpu gpu-job slurmtes *PD* 0:00 1 >> (Resources) >> >> 217 gpu gpu-job slurmtes *R* 0:07 1 >> c-a100-cn01 >> >> >> >> Basically I’m seeking for some help/hints on how to tell Slurm, from the >> batch script for example: “I want only 1 or 2 GPUs to be used/consumed by >> the job”, and then I run the batch script/job a couple of times with sbatch >> command, and confirm that we can indeed have multiple jobs using a GPU and >> running in parallel, at the same time. >> >> >> >> Makes sense ? >> >> >> >> >> >> Best regards, >> >> >> >> *Hafedh * >> >> >> >> *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf >> Of *Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) >> *Sent:* jeudi 18 janvier 2024 2:30 PM >> *To:* Slurm User Community List <slurm-users@lists.schedmd.com> >> *Subject:* Re: [slurm-users] Need help with running multiple >> instances/executions of a batch script in parallel (with NVIDIA HGX A100 >> GPU as a Gres) >> >> >> >> On Jan 18, 2024, at 7:31 AM, Matthias Loose <m.lo...@mindcode.de> wrote: >> >> >> >> Hi Hafedh, >> >> Im no expert in the GPU side of SLURM, but looking at you current >> configuration to me its working as intended at the moment. You have defined >> 4 GPUs and start multiple jobs each consuming 4 GPUs each. So the jobs wait >> for the ressource the be free again. >> >> I think what you need to look into is the MPS plugin, which seems to do >> what you are trying to achieve: >> https://slurm.schedmd.com/gres.html#MPS_Management >> >> >> >> I agree with the first paragraph. How many GPUs are you expecting each >> job to use? I'd have assumed, based on the original text, that each job is >> supposed to use 1 GPU, and the 4 jobs were supposed to be running >> side-by-side on the one node you have (with 4 GPUs). If so, you need to >> tell each job to request only 1 GPU, and currently each one is requesting 4. >> >> >> >> If your jobs are actually supposed to be using 4 GPUs each, I still don't >> see any advantage to MPS (at least in what is my usual GPU usage pattern): >> all the jobs will take longer to finish, because they are sharing the fixed >> resource. If they take turns, at least the first ones finish as fast as >> they can, and the last one will finish no later than it would have if they >> were all time-sharing the GPUs. I guess NVIDIA had something in mind when >> they developed MPS, so I guess our pattern may not be typical (or at least >> not universal), and in that case the MPS plugin may well be what you need. >> > > > -- > Mohammed >