Hello Cristobal, Cristóbal Navarro <cristobal.navarr...@gmail.com> writes:
> Hello Angel and Community, > I am facing a similar problem with a DGX A100 with DGX OS 6 (Based on > Ubuntu 22.04 LTS) and Slurm 23.02. > When I execute `slurmd` service, it status shows failed with the > following information below. > As of today, what is the best solution to this problem? I am really > not sure if the DGX A100 could fail or not by disabling cgroups v1. > Any suggestions are welcome. did you manage to find a solution to this without disabling cgroups v1? In our case: ,---- | slurm 23.02.3 | Ubuntu 22.04.3 LTS | | # cat /proc/cmdline | BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=... ro quiet splash cgroup_no_v1=all vt.handoff=7 `---- disabling cgroups v1 has been working reliably, but it would be nice to find a solution that doesn't require modifying the kernel parameters. Cheers, -- Ángel de Vicente Research Software Engineer (Supercomputing and BigData) Tel.: +34 922-605-747 Web.: http://research.iac.es/proyecto/polmag/ GPG: 0x8BDC390B69033F52
smime.p7s
Description: S/MIME cryptographic signature