Hello Angel and Community, I am facing a similar problem with a DGX A100 with DGX OS 6 (Based on Ubuntu 22.04 LTS) and Slurm 23.02. When I execute `slurmd` service, it status shows failed with the following information below. As of today, what is the best solution to this problem? I am really not sure if the DGX A100 could fail or not by disabling cgroups v1. Any suggestions are welcome.
➜ slurm-23.02.3 systemctl status slurmd.service × slurmd.service - Slurm node daemon Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Mon 2023-07-24 19:07:03 -04; 7s ago Process: 3680019 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE) Main PID: 3680019 (code=exited, status=1/FAILURE) CPU: 40ms jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug: Log file re-opened jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_init jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_load jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_export_xml jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug: CPUs:128 Boards:1 Sockets:2 CoresPerSocket:64 ThreadsPerCore:1 jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: Node reconfigured socket/core boundaries SocketsPerBoard=8:2(hw) CoresPerSocket=16:64(hw) jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are: 2:freezer:/ jul 24 19:07:03 nodeGPU01 slurmd[3680019]: 0::/init.scope jul 24 19:07:03 nodeGPU01 systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE jul 24 19:07:03 nodeGPU01 systemd[1]: slurmd.service: Failed with result 'exit-code'. ➜ slurm-23.02.3 On Wed, May 3, 2023 at 6:32 PM Angel de Vicente <angel.de.vice...@iac.es> wrote: > Hello, > > Angel de Vicente <angel.de.vice...@iac.es> writes: > > > ,---- > > | slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are: > > | 5:freezer:/ > > | 3:cpuacct:/ > > `---- > > in the end I learnt that despite Ubuntu 22.04 reporting to be using > only cgroup V2, it was also using V1 and creating those mount points, > and then Slurm 23.02.01 was complaining that it could not work with > Cgroups in hybrid mode. > > So, the "solution" (as far as you don't need V1 for some reason) was to > add "cgroup_no_v1=all" to the Kernel parameters and reboot: no more V1 > mount points and Slurm was happy with that. > > [in case somebody is interested in the future, I needed this so that I > could limit the resources given to users not using Slurm. We have some > shared workstations with many cores and users were oversubscribing the > CPUs, so I have installed Slurm to put some order in the executions > there. But these machines are not an actual cluster with a login node: > the login node is the same as the executing node! So with cgroups I > control that users connecting via ssh only have the resources equivalent > to 3/4 of a core (enough to edit files, etc.) until they submit their > jobs via Slurm, when they then get the full allocation they requested]. > > Cheers, > -- > Ángel de Vicente > Research Software Engineer (Supercomputing and BigData) > Tel.: +34 922-605-747 > Web.: http://research.iac.es/proyecto/polmag/ > > GPG: 0x8BDC390B69033F52 > -- Cristóbal A. Navarro