Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-18 Thread Kherfani, Hafedh (Professional Services, TC)
Hafedh (Professional Services, TC) Sent: Thursday, January 18, 2024 9:38 AM To: Slurm User Community List mailto:slurm-users@lists.schedmd.com>> Subject: Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-18 Thread Kherfani, Hafedh (Professional Services, TC)
Hi Noam and Matthias, Thanks both for your answers. I changed the "#SBATCH --gres=gpu:4" directive (in the batch script) with "#SBATCH --gres=gpu:1" as you suggested, but it didn't make a difference, as running this batch script 3 times will result in the first job to be in a running state, wh

[slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-18 Thread Kherfani, Hafedh (Professional Services, TC)
Hello Experts, I'm a new Slurm user (so please bare with me :) ...). Recently we've deployed Slurm version 23.11 on a very simple cluster, which consists of a Master node (acting as a Login & Slurmdbd node as well), a Compute Node which has a NVIDIA HGX A100-SXM4-40GB GPU, detected as 4 x GPU's