Am 26.07.23 um 11:38 schrieb Ralf Utermann:
Am 25.07.23 um 02:09 schrieb Cristóbal Navarro:
Hello Angel and Community,
I am facing a similar problem with a DGX A100 with DGX OS 6 (Based on Ubuntu 
22.04 LTS) and Slurm 23.02.
When I execute `slurmd` service, it status shows failed with the following 
information below.

Hello Cristobal,

we see similar problems not on DGX but standard server nodes running
Ubuntu 22.04 (kernel 5.15.0-76-generic) and Slurm 23.02.3.

The first start of the slurmd service always fails, with lots of errors
in the slurmd.log like:
   error: cpu cgroup controller is not available.
   error: There's an issue initializing memory or cpu controller
After 90 seconds this slurmd service start times out and is failed.

BUT: One process is still running:
   /usr/local/slurm/23.02.3/sbin/slurmstepd infinity

This looks like the process started to handle cgroup v2 as described in
   https://slurm.schedmd.com/cgroup_v2.html

When we keep this slurmstepd infinity running, and just start
the slurmd service a second time, everything comes up running.

So our current workaround is: we configure the slurmd service
with a Restart=on-failure in the [Service] section.


Are there real solutions to this initial timeout failure?

best regards, Ralf



As of today, what is the best solution to this problem? I am really not sure if 
the DGX A100 could fail or not by disabling cgroups v1.
Any suggestions are welcome.

➜  slurm-23.02.3 systemctl status slurmd.service
× slurmd.service - Slurm node daemon
      Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor 
preset: enabled)
      Active: failed (Result: exit-code) since Mon 2023-07-24 19:07:03 -04; 7s 
ago
     Process: 3680019 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS 
(code=exited, status=1/FAILURE)
    Main PID: 3680019 (code=exited, status=1/FAILURE)
         CPU: 40ms

jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug:  Log file re-opened
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_init
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_load
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: 
hwloc_topology_export_xml
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug:  CPUs:128 Boards:1 
Sockets:2 CoresPerSocket:64 ThreadsPerCore:1
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: Node reconfigured 
socket/core boundaries SocketsPerBoard=8:2(hw) CoresPerSocket=16:64(hw)
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: fatal: Hybrid mode is not 
supported. Mounted cgroups are: 2:freezer:/
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: 0::/init.scope
jul 24 19:07:03 nodeGPU01 systemd[1]: slurmd.service: Main process exited, 
code=exited, status=1/FAILURE
jul 24 19:07:03 nodeGPU01 systemd[1]: slurmd.service: Failed with result 
'exit-code'.
➜  slurm-23.02.3



On Wed, May 3, 2023 at 6:32 PM Angel de Vicente <angel.de.vice...@iac.es 
<mailto:angel.de.vice...@iac.es>> wrote:

    Hello,

    Angel de Vicente <angel.de.vice...@iac.es <mailto:angel.de.vice...@iac.es>> 
writes:

     > ,----
     > | slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are:
     > | 5:freezer:/
     > | 3:cpuacct:/
     > `----

    in the end I learnt that despite Ubuntu 22.04 reporting to be using
    only cgroup V2, it was also using V1 and creating those mount points,
    and then Slurm 23.02.01 was complaining that it could not work with
    Cgroups in hybrid mode.

    So, the "solution" (as far as you don't need V1 for some reason) was to
    add "cgroup_no_v1=all" to the Kernel parameters and reboot: no more V1
    mount points and Slurm was happy with that.

    [in case somebody is interested in the future, I needed this so that I
    could limit the resources given to users not using Slurm. We have some
    shared workstations with many cores and users were oversubscribing the
    CPUs, so I have installed Slurm to put some order in the executions
    there. But these machines are not an actual cluster with a login node:
    the login node is the same as the executing node! So with cgroups I
    control that users connecting via ssh only have the resources equivalent
    to 3/4 of a core (enough to edit files, etc.) until they submit their
    jobs via Slurm, when they then get the full allocation they requested].

    Cheers,
    --     Ángel de Vicente
      Research Software Engineer (Supercomputing and BigData)
      Tel.: +34 922-605-747
      Web.: http://research.iac.es/proyecto/polmag/ 
<http://research.iac.es/proyecto/polmag/>

      GPG: 0x8BDC390B69033F52



--
Cristóbal A. Navarro


--
        Ralf Utermann

        Universität Augsburg
        Rechenzentrum
        D-86135 Augsburg

        ralf.uterm...@uni-a.de
        https://www.rz.uni-augsburg.de


Reply via email to