Am 26.07.23 um 11:38 schrieb Ralf Utermann:
Am 25.07.23 um 02:09 schrieb Cristóbal Navarro:
Hello Angel and Community,
I am facing a similar problem with a DGX A100 with DGX OS 6 (Based on Ubuntu
22.04 LTS) and Slurm 23.02.
When I execute `slurmd` service, it status shows failed with the following
information below.
Hello Cristobal,
we see similar problems not on DGX but standard server nodes running
Ubuntu 22.04 (kernel 5.15.0-76-generic) and Slurm 23.02.3.
The first start of the slurmd service always fails, with lots of errors
in the slurmd.log like:
error: cpu cgroup controller is not available.
error: There's an issue initializing memory or cpu controller
After 90 seconds this slurmd service start times out and is failed.
BUT: One process is still running:
/usr/local/slurm/23.02.3/sbin/slurmstepd infinity
This looks like the process started to handle cgroup v2 as described in
https://slurm.schedmd.com/cgroup_v2.html
When we keep this slurmstepd infinity running, and just start
the slurmd service a second time, everything comes up running.
So our current workaround is: we configure the slurmd service
with a Restart=on-failure in the [Service] section.
Are there real solutions to this initial timeout failure?
best regards, Ralf
As of today, what is the best solution to this problem? I am really not sure if
the DGX A100 could fail or not by disabling cgroups v1.
Any suggestions are welcome.
➜ slurm-23.02.3 systemctl status slurmd.service
× slurmd.service - Slurm node daemon
Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor
preset: enabled)
Active: failed (Result: exit-code) since Mon 2023-07-24 19:07:03 -04; 7s
ago
Process: 3680019 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS
(code=exited, status=1/FAILURE)
Main PID: 3680019 (code=exited, status=1/FAILURE)
CPU: 40ms
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug: Log file re-opened
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_init
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_load
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2:
hwloc_topology_export_xml
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug: CPUs:128 Boards:1
Sockets:2 CoresPerSocket:64 ThreadsPerCore:1
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: Node reconfigured
socket/core boundaries SocketsPerBoard=8:2(hw) CoresPerSocket=16:64(hw)
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: fatal: Hybrid mode is not
supported. Mounted cgroups are: 2:freezer:/
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: 0::/init.scope
jul 24 19:07:03 nodeGPU01 systemd[1]: slurmd.service: Main process exited,
code=exited, status=1/FAILURE
jul 24 19:07:03 nodeGPU01 systemd[1]: slurmd.service: Failed with result
'exit-code'.
➜ slurm-23.02.3
On Wed, May 3, 2023 at 6:32 PM Angel de Vicente <angel.de.vice...@iac.es
<mailto:angel.de.vice...@iac.es>> wrote:
Hello,
Angel de Vicente <angel.de.vice...@iac.es <mailto:angel.de.vice...@iac.es>>
writes:
> ,----
> | slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are:
> | 5:freezer:/
> | 3:cpuacct:/
> `----
in the end I learnt that despite Ubuntu 22.04 reporting to be using
only cgroup V2, it was also using V1 and creating those mount points,
and then Slurm 23.02.01 was complaining that it could not work with
Cgroups in hybrid mode.
So, the "solution" (as far as you don't need V1 for some reason) was to
add "cgroup_no_v1=all" to the Kernel parameters and reboot: no more V1
mount points and Slurm was happy with that.
[in case somebody is interested in the future, I needed this so that I
could limit the resources given to users not using Slurm. We have some
shared workstations with many cores and users were oversubscribing the
CPUs, so I have installed Slurm to put some order in the executions
there. But these machines are not an actual cluster with a login node:
the login node is the same as the executing node! So with cgroups I
control that users connecting via ssh only have the resources equivalent
to 3/4 of a core (enough to edit files, etc.) until they submit their
jobs via Slurm, when they then get the full allocation they requested].
Cheers,
-- Ángel de Vicente
Research Software Engineer (Supercomputing and BigData)
Tel.: +34 922-605-747
Web.: http://research.iac.es/proyecto/polmag/
<http://research.iac.es/proyecto/polmag/>
GPG: 0x8BDC390B69033F52
--
Cristóbal A. Navarro
--
Ralf Utermann
Universität Augsburg
Rechenzentrum
D-86135 Augsburg
ralf.uterm...@uni-a.de
https://www.rz.uni-augsburg.de