[slurm-users] Running slurm job on requested nvidia mig device

2024-01-18 Thread Dražen Jalšovec
Hi, We are testing the MIG deployment on our new slurm compute node with 4 x H100 GPUs. It looks like everything is configured correctly but we have a problem accessing mig devices. When I submit jobs requesting a mig gpu device #SBATCH --gres=gpu:H100_1g.10gb:1, the jobs get submitted to the node,

[slurm-users] Potential Side Effects of larger MessageTimeout value

2024-01-18 Thread Herc Silverstein
Hi, What are potential bad side effects of using a large/larger MessageTimeout? And is there a value at which this setting is too large (long)? Thanks, Herc

Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-18 Thread Kherfani, Hafedh (Professional Services, TC)
Hi Ümit, Troy, I removed the line “#SBATCH --gres=gpu:1”, and changed the sbatch directive “--gpus-per-node=4” to “--gpus-per-node=1”, but still getting the same result: When running multiple sbatch commands for the same script, only one job (first execution) is running, and all subsequent jobs

Re: [slurm-users] error

2024-01-18 Thread Ole Holm Nielsen
On 1/18/24 17:42, Felix wrote: I started a new AMD node, and the error is as follows: "CPU frequency setting not configured for this node" extended looks like this: [2024-01-18T18:28:06.682] CPU frequency setting not configured for this node [2024-01-18T18:28:06.691] slurmd started on Thu, 18

[slurm-users] error

2024-01-18 Thread Felix
Hello I started a new AMD node, and the error is as follows: "CPU frequency setting not configured for this node" extended looks like this: [2024-01-18T18:28:06.682] CPU frequency setting not configured for this node [2024-01-18T18:28:06.691] slurmd started on Thu, 18 Jan 2024 18:28:06 +0200 [

Re: [slurm-users] [BULK] slurm-users Digest, Vol 75, Issue 26

2024-01-18 Thread Jason Macklin
when they developed MPS, so I guess our pattern may not be typical (or at least not universal), and in that case the MPS plugin may well be what you need. -- next part -- An HTML attachment was scrubbed... URL: <http://lists.schedmd.com/pipermail/slurm-users/a

Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-18 Thread Baer, Troy
Hi Hafedh, Your job script has the sbatch directive “—gpus-per-node=4” set. I suspect that if you look at what’s allocated to the running job by doing “scontrol show job ” and looking at the TRES field, it’s been allocated 4 GPUs instead of one. Regards, --Troy From: slurm-us

Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-18 Thread Ümit Seren
This line also has tobe changed: #SBATCH --gpus-per-node=4 • #SBATCH --gpus-per-node=1 --gpus-per-node seems to be the new parameter that is replacing the --gres= one, so you can remove the –gres line completely. Best Ümit From: slurm-users on behalf of Kherfani, Hafedh (Professional Servi

Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-18 Thread Kherfani, Hafedh (Professional Services, TC)
Hi Noam and Matthias, Thanks both for your answers. I changed the "#SBATCH --gres=gpu:4" directive (in the batch script) with "#SBATCH --gres=gpu:1" as you suggested, but it didn't make a difference, as running this batch script 3 times will result in the first job to be in a running state, wh

Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-18 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
On Jan 18, 2024, at 7:31 AM, Matthias Loose wrote: Hi Hafedh, Im no expert in the GPU side of SLURM, but looking at you current configuration to me its working as intended at the moment. You have defined 4 GPUs and start multiple jobs each consuming 4 GPUs each. So the jobs wait for the ressou

Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-18 Thread Matthias Loose
Hi Hafedh, Im no expert in the GPU side of SLURM, but looking at you current configuration to me its working as intended at the moment. You have defined 4 GPUs and start multiple jobs each consuming 4 GPUs each. So the jobs wait for the ressource the be free again. I think what you need to l

[slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-18 Thread Kherfani, Hafedh (Professional Services, TC)
Hello Experts, I'm a new Slurm user (so please bare with me :) ...). Recently we've deployed Slurm version 23.11 on a very simple cluster, which consists of a Master node (acting as a Login & Slurmdbd node as well), a Compute Node which has a NVIDIA HGX A100-SXM4-40GB GPU, detected as 4 x GPU's

Re: [slurm-users] slurm.conf

2024-01-18 Thread Cutts, Tim
Can you not also do this with a single configuration file but configuring multiple clusters which the user can choose with the -M option? I suppose it depends on the use case; if you want to be able to choose a dev cluster over the production one, to test new config options, then the environmen

Re: [slurm-users] slurm.conf

2024-01-18 Thread Hermann Schwärzler
Hi Christine, yes, you can either set the environment variable SLURM_CONF to the full path of the configuration-file you want to use and then run any program. Or you can do it like this SLURM_CONF=/your/path/to/slurm.conf sinfo|sbatch|srun|... But I am not quite sure if this is really the be

Re: [slurm-users] slurm.conf

2024-01-18 Thread Bjørn-Helge Mevik
LEROY Christine 208562 writes: > Is there an env variable in SLURM to tell where the slurm.conf is? > We would like to have on the same client node, 2 type of possible submissions > to address 2 different cluster. According to man sbatch: SLURM_CONFThe location of the Slurm

[slurm-users] slurm.conf

2024-01-18 Thread LEROY Christine 208562
Hello all, Is there an env variable in SLURM to tell where the slurm.conf is? We would like to have on the same client node, 2 type of possible submissions to address 2 different cluster. Thanks in advance, Christine

Re: [slurm-users] What happens if GPU GRES exceeding number of GPUs per node

2024-01-18 Thread Juergen Salk
Hi Wirawan, in general `--gres=gpu:6´ actually means six units of a generic resource named `gpu´ per node. Each unit may or may not be associated with a physical GPU device. I'd check the node configuration for the number of gres=gpu resource units that are configured for that node. scont