[slurm-users] Re: First setup of slurm with a GPU node

Patrick Begou via slurm-users Wed, 13 Nov 2024 13:09:26 -0800

Hi Benjamin,

Yes, I saw this on an archived discussion too and I've added theseparameters. A little bit tricky to do as my setup is deployed viaAnsible. But with this setup I'm not able to request a GPU at all. Allthese test are failing and slurm do not accept the job:


srun -n 1 -p tenibre-gpu --gres=gpu:A100-40 ./a.out
srun -n 1 -p tenibre-gpu --gres=gpu:A100-40:1 ./a.out
srun -n 1 -p tenibre-gpu --gpus-per-node=A100-40:1 ./a.out
srun -n 1 -p tenibre-gpu --gpus-per-node=1 ./a.out
srun -n 1 -p tenibre-gpu --gres=gpu:1 ./a.out

May be some restrictions on the GPU type field with the "minus" sign ?No idea. But launching a GPU code without reserving a GPU is failing atexecution time on the node. So a first step is done!

May be should I upgrade my slurm version from 20.11 to the latest. ButI had to set the cluster back in production without the GPU setup thisevening.


Patrick

Le 13/11/2024 à 17:31, Benjamin Smith via slurm-users a écrit :

Hi Patrick,

You're missing a Gres= on your node in your slurm.conf:
Nodename=tenibre-gpu-0 RealMemory=257270 Sockets=2 CoresPerSocket=16ThreadsPerCore=1 State=UNKNOWN *Gres=gpu:A100-40:1,gpu:A100-80:1
*

Ben


On 13/11/2024 16:00, Patrick Begou via slurm-users wrote:
This email was sent to you by someone outside the University.
You should only click on links or attachments if you are certain thatthe email is genuine and the content is safe.
Le 13/11/2024 à 15:45, Roberto Polverelli Monti via slurm-users a écrit :
Hello Patrick,

On 11/13/24 12:01 PM, Patrick Begou via slurm-users wrote:
As using this GPU resource increase I would like to manage thisresource with Gres to avoid usage conflict. But at this time mysetup do not works as I can reach a GPU without reserving it:
    srun -n 1 -p tenibre-gpu ./a.out
can use a GPU even if the reservation do not specify this resource(checked with running nvidia-smi on the node). "tenibre-gpu" is aslurm partition with only this gpu node.
I think what you're looking for is the ConstrainDevices parameter incgroup.conf.
See here:
- https://slurm.schedmd.com/archive/slurm-20.11.7/cgroup.conf.html

Best,
Hi Roberto,
thanks for pointing to this parameter. I set it, update all thenodes, restart slurmd everywhere but it does not change the behavior.However, when looking in the slurmd log on the GPU node I notice thisinformation:
[2024-11-13T16:41:08.434] debug: CPUs:32 Boards:1 Sockets:8CoresPerSocket:4 ThreadsPerCore:1
*[2024-11-13T16:41:08.434] debug: gres/gpu: init: loaded*
*[2024-11-13T16:41:08.434] WARNING: A line in gres.conf for GRESgpu:A100-40 has 1 more configured than expected in slurm.conf.Ignoring extra GRES.**[2024-11-13T16:41:08.434] WARNING: A line in gres.conf for GRESgpu:A100-80 has 1 more configured than expected in slurm.conf.Ignoring extra GRES.*[2024-11-13T16:41:08.434] debug: gpu/generic: init: init: GPU Genericplugin loaded[2024-11-13T16:41:08.434] topology/none: init: topology NONE pluginloaded[2024-11-13T16:41:08.434] route/default: init: route default pluginloaded[2024-11-13T16:41:08.434] CPU frequency setting not configured forthis node[2024-11-13T16:41:08.434] debug: Resource spec: No specialized coresconfigured by default on this node[2024-11-13T16:41:08.434] debug: Resource spec: Reserved systemmemory limit not configured for this node[2024-11-13T16:41:08.434] debug: Reading cgroup.conf file/etc/slurm/cgroup.conf[2024-11-13T16:41:08.434] error: MaxSwapPercent value (0.0%) is not avalid number[2024-11-13T16:41:08.436] debug: task/cgroup: init: core enforcementenabled[2024-11-13T16:41:08.437] debug: task/cgroup:task_cgroup_memory_init: task/cgroup/memory: total:257281Mallowed:100%(enforced), swap:0%(enforced), max:100%(257281M)max+swap:100%(514562M) min:30M kmem:100%(257281M permissive) min:30Mswappiness:0(unset)[2024-11-13T16:41:08.437] debug: task/cgroup: init: memoryenforcement enabled*[2024-11-13T16:41:08.438] debug: task/cgroup:task_cgroup_devices_init: unable to open/etc/slurm/cgroup_allowed_devices_file.conf: No such file or directory*[2024-11-13T16:41:08.438] debug: task/cgroup: init: deviceenforcement enabled
[2024-11-13T16:41:08.438] debug: task/cgroup: init: task/cgroup: loaded
[2024-11-13T16:41:08.438] debug: auth/munge: init: Mungeauthentication plugin loaded
So something is wrong in may gres.conf file I think as I ttry doconfigure 2 different devices on the node may be?
## GPU setup on tenibre-gpu-0
NodeName=tenibre-gpu-0 Name=gpu Type=A100-40 File=/dev/nvidia0Flags=nvidia_gpu_envNodeName=tenibre-gpu-0 Name=gpu Type=A100-80 File=/dev/nvidia1Flags=nvidia_gpu_env
Patrick
--
Benjamin Smith<bsmi...@ed.ac.uk>
Computing Officer, AT-7.12a
Research and Teaching Unit
School of Informatics, University of Edinburgh
The University of Edinburgh is a charitable body, registered inScotland, with registration number SC005336. Is e buidheanncarthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba,àireamh clàraidh SC005336.

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: First setup of slurm with a GPU node

Reply via email to