Hi, A few minutes ago recompiled the cgroups_v2 plugin from slurm with the fix included, replaced the old cgroups_v2.{a,la,so} files with the new ones on /usr/lib/slurm and now jobs work properly on that node. Many thanks for all the help. Indeed, in a few months we will update to the most recent 23.xx or 24.xx eventually.
On Wed, Jan 24, 2024 at 1:20 PM Tim Schneider < tim.schneid...@tu-darmstadt.de> wrote: > Hi, > > I just tested with 23.02.7-1 and the issue is gone. So it seems like the > patch got released. > > Best, > > Tim > > On 1/24/24 16:55, Stefan Fleischmann wrote: > > On Wed, 24 Jan 2024 12:37:04 -0300 Cristóbal Navarro > > <cristobal.navarr...@gmail.com> wrote: > >> Many thanks > >> One question? Do we have to apply this patch (and recompile slurm i > >> guess) only on the compute-node with problems? > >> Also, I noticed the patch now appears as "obsolete", is that ok? > > We have Slurm installed on a NFS share, so what I did was to recompile > > it and then I only replaced the library lib/slurm/cgroup_v2.so. Good > > enough for now, I've been planning to update to 23.11 anyway soon. > > > > I suppose it's marked as obsolete because the patch went into a > > release. According to the info in the bug report it should have been > > included in 23.02.4. > > > > Cheers, > > Stefan > > > >> On Wed, Jan 24, 2024 at 9:52 AM Stefan Fleischmann <s...@kth.se> > >> wrote: > >> > >>> Turns out I was wrong, this is not a problem in the kernel at all. > >>> It's a known bug that is triggered by long bpf logs, see here > >>> https://bugs.schedmd.com/show_bug.cgi?id=17210 > >>> > >>> There is a patch included there. > >>> > >>> Cheers, > >>> Stefan > >>> > >>> On Tue, 23 Jan 2024 15:28:59 +0100 Stefan Fleischmann <s...@kth.se> > >>> wrote: > >>>> I don't think there is much for SchedMD to do. As I said since it > >>>> is working fine with newer kernels there doesn't seem to be any > >>>> breaking change in cgroup2 in general, but only a regression > >>>> introduced in one of the latest updates in 5.15. > >>>> > >>>> If Slurm was doing something wrong with cgroup2, and it > >>>> accidentally worked until this recent change, then other kernel > >>>> versions should show the same behavior. But as far as I can tell > >>>> it still works just fine with newer kernels. > >>>> > >>>> Cheers, > >>>> Stefan > >>>> > >>>> On Tue, 23 Jan 2024 15:20:56 +0100 > >>>> Tim Schneider <tim.schneid...@tu-darmstadt.de> wrote: > >>>> > >>>>> Hi, > >>>>> > >>>>> I have filed a bug report with SchedMD > >>>>> (https://bugs.schedmd.com/show_bug.cgi?id=18623), but the > >>>>> support told me they cannot invest time in this issue since I > >>>>> don't have a support contract. Maybe they will look into it > >>>>> once it affects more people or someone important enough. > >>>>> > >>>>> So far, I have resorted to using 5.15.0-89-generic, but I am > >>>>> also a bit concerned about the security aspect of this choice. > >>>>> > >>>>> Best, > >>>>> > >>>>> Tim > >>>>> > >>>>> On 23.01.24 14:59, Stefan Fleischmann wrote: > >>>>>> Hi! > >>>>>> > >>>>>> I'm seeing the same in our environment. My conclusion is that > >>>>>> it is a regression in the Ubuntu 5.15 kernel, introduced with > >>>>>> 5.15.0-90-generic. Last working kernel version is > >>>>>> 5.15.0-89-generic. I have filed a bug report here: > >>>>>> https://bugs.launchpad.net/bugs/2050098 > >>>>>> > >>>>>> Please add yourself to the affected users in the bug report > >>>>>> so it hopefully gets more attention. > >>>>>> > >>>>>> I've tested with newer kernels (6.5, 6.6 and 6.7) and the > >>>>>> problem does not exist there. 6.5 is the latest hwe kernel > >>>>>> for 22.04 and would be an option for now. Reverting back to > >>>>>> 5.15.0-89 would work as well, but I haven't looked into the > >>>>>> security aspects of that. > >>>>>> > >>>>>> Cheers, > >>>>>> Stefan > >>>>>> > >>>>>> On Mon, 22 Jan 2024 13:31:15 -0300 > >>>>>> cristobal.navarro.g at gmail.com wrote: > >>>>>> > >>>>>>> Hi Tim and community, > >>>>>>> We are currently having the same issue (cgroups not working > >>>>>>> it seems, showing all GPUs on jobs) on a GPU-compute node > >>>>>>> (DGX A100) a couple of days ago after a full update (apt > >>>>>>> upgrade). Now whenever we launch a job for that partition, > >>>>>>> we get the error message mentioned by Tim. As a note, we > >>>>>>> have another custom GPU-compute node with L40s, on a > >>>>>>> different partition, and that one works fine. Before this > >>>>>>> error, we always had small differences in kernel version > >>>>>>> between nodes, so I am not sure if this can be the problem. > >>>>>>> Nevertheless, here is the info of our nodes as well. > >>>>>>> > >>>>>>> *[Problem node]* The DGX A100 node has this kernel > >>>>>>> cnavarro at nodeGPU01:~$ uname -a > >>>>>>> Linux nodeGPU01 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 > >>>>>>> 20:28:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux > >>>>>>> > >>>>>>> *[Functioning node]* The Custom GPU node (L40s) has this > >>>>>>> kernel cnavarro at nodeGPU02:~$ uname -a > >>>>>>> Linux nodeGPU02 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 > >>>>>>> 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux > >>>>>>> > >>>>>>> *And the login node *(slurmctld) > >>>>>>> ? ~ uname -a > >>>>>>> Linux patagon-master 5.15.0-91-generic #101-Ubuntu SMP Tue > >>>>>>> Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux > >>>>>>> > >>>>>>> Any ideas what we should check? > >>>>>>> > >>>>>>> On Thu, Jan 4, 2024 at 3:03?PM Tim Schneider <tim.schneider1 > >>>>>>> at tu-darmstadt.de> wrote: > >>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> I am using SLURM 22.05.9 on a small compute cluster. Since I > >>>>>>>> reinstalled two of our nodes, I get the following error when > >>>>>>>> launching a job: > >>>>>>>> > >>>>>>>> slurmstepd: error: load_ebpf_prog: BPF load error (No space > >>>>>>>> left on device). Please check your system limits (MEMLOCK). > >>>>>>>> > >>>>>>>> Also the cgroups do not seem to work properly anymore, as I > >>>>>>>> am able to see all GPUs even if I do not request them, > >>>>>>>> which is not the case on the other nodes. > >>>>>>>> > >>>>>>>> One difference I found between the updated nodes and the > >>>>>>>> original nodes (both are Ubuntu 22.04) is the kernel > >>>>>>>> version, which is "5.15.0-89-generic #99-Ubuntu SMP" on the > >>>>>>>> functioning nodes and "5.15.0-91-generic #101-Ubuntu SMP" > >>>>>>>> on the updated nodes. I could not figure out how to install > >>>>>>>> the exact first kernel version on the updated nodes, but I > >>>>>>>> noticed that when I reinstall 5.15.0 with this tool: > >>>>>>>> https://github.com/pimlie/ubuntu-mainline-kernel.sh, the > >>>>>>>> error message disappears. However, once I do that, the > >>>>>>>> network driver does not function properly anymore, so this > >>>>>>>> does not seem to be a good solution. > >>>>>>>> > >>>>>>>> Has anyone seen this issue before or is there maybe > >>>>>>>> something else I should take a look at? I am also happy to > >>>>>>>> just find a workaround such that I can take these nodes > >>>>>>>> back online. > >>>>>>>> > >>>>>>>> I appreciate any help! > >>>>>>>> > >>>>>>>> Thanks a lot in advance and best wishes, > >>>>>>>> > >>>>>>>> Tim > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>> > >>> > >> > -- Cristóbal A. Navarro