Le 13/11/2024 à 15:45, Roberto Polverelli Monti via slurm-users a écrit :
Hello Patrick,
On 11/13/24 12:01 PM, Patrick Begou via slurm-users wrote:
As using this GPU resource increase I would like to manage this
resource with Gres to avoid usage conflict. But at this time my setup
do not works as I can reach a GPU without reserving it:
srun -n 1 -p tenibre-gpu ./a.out
can use a GPU even if the reservation do not specify this resource
(checked with running nvidia-smi on the node). "tenibre-gpu" is a
slurm partition with only this gpu node.
I think what you're looking for is the ConstrainDevices parameter in
cgroup.conf.
See here:
- https://slurm.schedmd.com/archive/slurm-20.11.7/cgroup.conf.html
Best,
Hi Roberto,
thanks for pointing to this parameter. I set it, update all the nodes,
restart slurmd everywhere but it does not change the behavior.
However, when looking in the slurmd log on the GPU node I notice this
information:
[2024-11-13T16:41:08.434] debug: CPUs:32 Boards:1 Sockets:8
CoresPerSocket:4 ThreadsPerCore:1
*[2024-11-13T16:41:08.434] debug: gres/gpu: init: loaded*
*[2024-11-13T16:41:08.434] WARNING: A line in gres.conf for GRES
gpu:A100-40 has 1 more configured than expected in slurm.conf. Ignoring
extra GRES.*
*[2024-11-13T16:41:08.434] WARNING: A line in gres.conf for GRES
gpu:A100-80 has 1 more configured than expected in slurm.conf. Ignoring
extra GRES.*
[2024-11-13T16:41:08.434] debug: gpu/generic: init: init: GPU Generic
plugin loaded
[2024-11-13T16:41:08.434] topology/none: init: topology NONE plugin loaded
[2024-11-13T16:41:08.434] route/default: init: route default plugin loaded
[2024-11-13T16:41:08.434] CPU frequency setting not configured for this node
[2024-11-13T16:41:08.434] debug: Resource spec: No specialized cores
configured by default on this node
[2024-11-13T16:41:08.434] debug: Resource spec: Reserved system memory
limit not configured for this node
[2024-11-13T16:41:08.434] debug: Reading cgroup.conf file
/etc/slurm/cgroup.conf
[2024-11-13T16:41:08.434] error: MaxSwapPercent value (0.0%) is not a
valid number
[2024-11-13T16:41:08.436] debug: task/cgroup: init: core enforcement enabled
[2024-11-13T16:41:08.437] debug: task/cgroup: task_cgroup_memory_init:
task/cgroup/memory: total:257281M allowed:100%(enforced),
swap:0%(enforced), max:100%(257281M) max+swap:100%(514562M) min:30M
kmem:100%(257281M permissive) min:30M swappiness:0(unset)
[2024-11-13T16:41:08.437] debug: task/cgroup: init: memory enforcement
enabled
*[2024-11-13T16:41:08.438] debug: task/cgroup: task_cgroup_devices_init:
unable to open /etc/slurm/cgroup_allowed_devices_file.conf: No such file
or directory*
[2024-11-13T16:41:08.438] debug: task/cgroup: init: device enforcement
enabled
[2024-11-13T16:41:08.438] debug: task/cgroup: init: task/cgroup: loaded
[2024-11-13T16:41:08.438] debug: auth/munge: init: Munge authentication
plugin loaded
So something is wrong in may gres.conf file I think as I ttry do
configure 2 different devices on the node may be?
## GPU setup on tenibre-gpu-0
NodeName=tenibre-gpu-0 Name=gpu Type=A100-40 File=/dev/nvidia0
Flags=nvidia_gpu_env
NodeName=tenibre-gpu-0 Name=gpu Type=A100-80 File=/dev/nvidia1
Flags=nvidia_gpu_env
Patrick
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com