[slurm-users] Re: First setup of slurm with a GPU node

Jason Simms via slurm-users Wed, 13 Nov 2024 13:17:11 -0800

Hello Patrick,

Yeah I'd recommend upgrading, and I imagine most others will, too. I have
found with Slurm that upgrades are nearly mandatory, at least annually or
so, mostly because it's more challenging to upgrade from much older
versions and requires bootstrapping. Not sure about the minus sign; that's
an interesting hypothesis. For what it's worth, we don't use minus signs in
our names. You may want no characters like that, or perhaps an underscore.


Jason

On Wed, Nov 13, 2024 at 4:11 PM Patrick Begou via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Hi Benjamin,
>
> Yes, I saw this on an archived discussion too and I've added these
> parameters. A little bit tricky to do as my setup is deployed via Ansible.
> But with this setup I'm not able to request a GPU at all. All these test
> are failing and slurm do not accept the job:
>
> srun -n 1 -p tenibre-gpu --gres=gpu:A100-40 ./a.out
> srun -n 1 -p tenibre-gpu --gres=gpu:A100-40:1 ./a.out
> srun -n 1 -p tenibre-gpu --gpus-per-node=A100-40:1 ./a.out
> srun -n 1 -p tenibre-gpu --gpus-per-node=1 ./a.out
> srun -n 1 -p tenibre-gpu --gres=gpu:1 ./a.out
>
> May be some restrictions on the GPU type field with the "minus" sign ? No
> idea. But launching a GPU code without reserving a GPU is failing at
> execution time on the node. So a first step is done!
>
> May be should I upgrade my slurm version  from 20.11 to the latest. But I
> had to set the cluster back in production without the GPU setup this
> evening.
>
> Patrick
>
> Le 13/11/2024 à 17:31, Benjamin Smith via slurm-users a écrit :
>
> Hi Patrick,
>
> You're missing a Gres= on your node in your slurm.conf:
>
> Nodename=tenibre-gpu-0 RealMemory=257270 Sockets=2 CoresPerSocket=16
> ThreadsPerCore=1 State=UNKNOWN
> *Gres=gpu:A100-40:1,gpu:A100-80:1 *
>
> Ben
>
> On 13/11/2024 16:00, Patrick Begou via slurm-users wrote:
>
> This email was sent to you by someone outside the University.
> You should only click on links or attachments if you are certain that the
> email is genuine and the content is safe.
> Le 13/11/2024 à 15:45, Roberto Polverelli Monti via slurm-users a écrit :
>
> Hello Patrick,
>
> On 11/13/24 12:01 PM, Patrick Begou via slurm-users wrote:
>
> As using this GPU resource increase I would like to manage this resource
> with Gres to avoid usage conflict. But at this time my setup do not works
> as I can reach a GPU without reserving it:
>
>     srun -n 1 -p tenibre-gpu ./a.out
>
> can use a GPU even if the reservation do not specify this resource
> (checked with running nvidia-smi  on the node). "tenibre-gpu" is a slurm
> partition with only this gpu node.
>
>
> I think what you're looking for is the ConstrainDevices parameter in
> cgroup.conf.
>
> See here:
> - https://slurm.schedmd.com/archive/slurm-20.11.7/cgroup.conf.html
>
> Best,
>
> Hi Roberto,
>
> thanks for pointing to this parameter. I set it, update all the nodes,
> restart slurmd everywhere but it does not change the behavior.
> However, when looking in the slurmd log on the GPU node I notice this
> information:
>
>
> [2024-11-13T16:41:08.434] debug:  CPUs:32 Boards:1 Sockets:8
> CoresPerSocket:4 ThreadsPerCore:1
> *[2024-11-13T16:41:08.434] debug:  gres/gpu: init: loaded*
> *[2024-11-13T16:41:08.434] WARNING: A line in gres.conf for GRES
> gpu:A100-40 has 1 more configured than expected in slurm.conf. Ignoring
> extra GRES.*
> *[2024-11-13T16:41:08.434] WARNING: A line in gres.conf for GRES
> gpu:A100-80 has 1 more configured than expected in slurm.conf. Ignoring
> extra GRES.*
> [2024-11-13T16:41:08.434] debug:  gpu/generic: init: init: GPU Generic
> plugin loaded
> [2024-11-13T16:41:08.434] topology/none: init: topology NONE plugin loaded
> [2024-11-13T16:41:08.434] route/default: init: route default plugin loaded
> [2024-11-13T16:41:08.434] CPU frequency setting not configured for this
> node
> [2024-11-13T16:41:08.434] debug:  Resource spec: No specialized cores
> configured by default on this node
> [2024-11-13T16:41:08.434] debug:  Resource spec: Reserved system memory
> limit not configured for this node
> [2024-11-13T16:41:08.434] debug:  Reading cgroup.conf file
> /etc/slurm/cgroup.conf
> [2024-11-13T16:41:08.434] error: MaxSwapPercent value (0.0%) is not a
> valid number
> [2024-11-13T16:41:08.436] debug:  task/cgroup: init: core enforcement
> enabled
> [2024-11-13T16:41:08.437] debug:  task/cgroup: task_cgroup_memory_init:
> task/cgroup/memory: total:257281M allowed:100%(enforced),
> swap:0%(enforced), max:100%(257281M) max+swap:100%(514562M) min:30M
> kmem:100%(257281M permissive) min:30M swappiness:0(unset)
> [2024-11-13T16:41:08.437] debug:  task/cgroup: init: memory enforcement
> enabled
> *[2024-11-13T16:41:08.438] debug:  task/cgroup: task_cgroup_devices_init:
> unable to open /etc/slurm/cgroup_allowed_devices_file.conf: No such file or
> directory*
> [2024-11-13T16:41:08.438] debug:  task/cgroup: init: device enforcement
> enabled
> [2024-11-13T16:41:08.438] debug:  task/cgroup: init: task/cgroup: loaded
> [2024-11-13T16:41:08.438] debug:  auth/munge: init: Munge authentication
> plugin loaded
>
> So something is wrong in may gres.conf file I think as I ttry do configure
> 2 different devices on the node may be?
>
> ## GPU setup on tenibre-gpu-0
> NodeName=tenibre-gpu-0 Name=gpu Type=A100-40 File=/dev/nvidia0
> Flags=nvidia_gpu_env
> NodeName=tenibre-gpu-0 Name=gpu Type=A100-80 File=/dev/nvidia1
> Flags=nvidia_gpu_env
>
> Patrick
>
> --
> Benjamin Smith <bsmi...@ed.ac.uk> <bsmi...@ed.ac.uk>
> Computing Officer, AT-7.12a
> Research and Teaching Unit
> School of Informatics, University of Edinburgh
>
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336. Is e buidheann carthannais a th’ ann an
> Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
>
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>


-- 
*Jason L. Simms, Ph.D., M.P.H.*
Research Computing Manager
Swarthmore College
Information Technology Services
(610) 328-8102

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: First setup of slurm with a GPU node

Reply via email to