Hello Cristobal,

Cristóbal Navarro <cristobal.navarr...@gmail.com> writes:

> Hello Angel and Community,

> I am facing a similar problem with a DGX A100 with DGX OS 6 (Based on
> Ubuntu 22.04 LTS) and Slurm 23.02.
> When I execute `slurmd` service, it status shows failed with the
> following information below.
> As of today, what is the best solution to this problem? I am really
> not sure if the DGX A100 could fail or not by disabling cgroups v1.
> Any suggestions are welcome.

did you manage to find a solution to this without disabling cgroups v1?

In our case:

,----
| slurm 23.02.3
| Ubuntu 22.04.3 LTS
| 
| # cat /proc/cmdline 
| BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=... ro quiet splash 
cgroup_no_v1=all vt.handoff=7
`----

disabling cgroups v1 has been working reliably, but it would be nice to
find a solution that doesn't require modifying the kernel parameters.

Cheers,
-- 
Ángel de Vicente
 Research Software Engineer (Supercomputing and BigData)
 Tel.: +34 922-605-747
 Web.: http://research.iac.es/proyecto/polmag/

 GPG: 0x8BDC390B69033F52

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to