Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

Cristóbal Navarro Wed, 24 Jan 2024 07:39:25 -0800

Many thanks
One question? Do we have to apply this patch (and recompile slurm i guess)
only on the compute-node with problems?
Also, I noticed the patch now appears as "obsolete", is that ok?


On Wed, Jan 24, 2024 at 9:52 AM Stefan Fleischmann <s...@kth.se> wrote:

> Turns out I was wrong, this is not a problem in the kernel at all. It's
> a known bug that is triggered by long bpf logs, see here
>  https://bugs.schedmd.com/show_bug.cgi?id=17210
>
> There is a patch included there.
>
> Cheers,
> Stefan
>
> On Tue, 23 Jan 2024 15:28:59 +0100 Stefan Fleischmann <s...@kth.se>
> wrote:
> > I don't think there is much for SchedMD to do. As I said since it is
> > working fine with newer kernels there doesn't seem to be any breaking
> > change in cgroup2 in general, but only a regression introduced in one
> > of the latest updates in 5.15.
> >
> > If Slurm was doing something wrong with cgroup2, and it accidentally
> > worked until this recent change, then other kernel versions should
> > show the same behavior. But as far as I can tell it still works just
> > fine with newer kernels.
> >
> > Cheers,
> > Stefan
> >
> > On Tue, 23 Jan 2024 15:20:56 +0100
> > Tim Schneider <tim.schneid...@tu-darmstadt.de> wrote:
> >
> > > Hi,
> > >
> > > I have filed a bug report with SchedMD
> > > (https://bugs.schedmd.com/show_bug.cgi?id=18623), but the support
> > > told me they cannot invest time in this issue since I don't have a
> > > support contract. Maybe they will look into it once it affects more
> > > people or someone important enough.
> > >
> > > So far, I have resorted to using 5.15.0-89-generic, but I am also a
> > > bit concerned about the security aspect of this choice.
> > >
> > > Best,
> > >
> > > Tim
> > >
> > > On 23.01.24 14:59, Stefan Fleischmann wrote:
> > > > Hi!
> > > >
> > > > I'm seeing the same in our environment. My conclusion is that it
> > > > is a regression in the Ubuntu 5.15 kernel, introduced with
> > > > 5.15.0-90-generic. Last working kernel version is
> > > > 5.15.0-89-generic. I have filed a bug report here:
> > > > https://bugs.launchpad.net/bugs/2050098
> > > >
> > > > Please add yourself to the affected users in the bug report so it
> > > > hopefully gets more attention.
> > > >
> > > > I've tested with newer kernels (6.5, 6.6 and 6.7) and the problem
> > > > does not exist there. 6.5 is the latest hwe kernel for 22.04 and
> > > > would be an option for now. Reverting back to 5.15.0-89 would work
> > > > as well, but I haven't looked into the security aspects of that.
> > > >
> > > > Cheers,
> > > > Stefan
> > > >
> > > > On Mon, 22 Jan 2024 13:31:15 -0300
> > > > cristobal.navarro.g at gmail.com wrote:
> > > >
> > > >> Hi Tim and community,
> > > >> We are currently having the same issue (cgroups not working it
> > > >> seems, showing all GPUs on jobs) on a GPU-compute node (DGX A100)
> > > >> a couple of days ago after a full update (apt upgrade). Now
> > > >> whenever we launch a job for that partition, we get the error
> > > >> message mentioned by Tim. As a note, we have another custom
> > > >> GPU-compute node with L40s, on a different partition, and that
> > > >> one works fine. Before this error, we always had small
> > > >> differences in kernel version between nodes, so I am not sure if
> > > >> this can be the problem. Nevertheless, here is the info of our
> > > >> nodes as well.
> > > >>
> > > >> *[Problem node]* The DGX A100 node has this kernel
> > > >> cnavarro at nodeGPU01:~$ uname -a
> > > >> Linux nodeGPU01 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15
> > > >> 20:28:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
> > > >>
> > > >> *[Functioning node]* The Custom GPU node (L40s) has this kernel
> > > >> cnavarro at nodeGPU02:~$ uname -a
> > > >> Linux nodeGPU02 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14
> > > >> 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
> > > >>
> > > >> *And the login node *(slurmctld)
> > > >> ?  ~ uname -a
> > > >> Linux patagon-master 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14
> > > >> 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
> > > >>
> > > >> Any ideas what we should check?
> > > >>
> > > >> On Thu, Jan 4, 2024 at 3:03?PM Tim Schneider <tim.schneider1 at
> > > >> tu-darmstadt.de> wrote:
> > > >>
> > > >>> Hi,
> > > >>>
> > > >>> I am using SLURM 22.05.9 on a small compute cluster. Since I
> > > >>> reinstalled two of our nodes, I get the following error when
> > > >>> launching a job:
> > > >>>
> > > >>> slurmstepd: error: load_ebpf_prog: BPF load error (No space left
> > > >>> on device). Please check your system limits (MEMLOCK).
> > > >>>
> > > >>> Also the cgroups do not seem to work properly anymore, as I am
> > > >>> able to see all GPUs even if I do not request them, which is not
> > > >>> the case on the other nodes.
> > > >>>
> > > >>> One difference I found between the updated nodes and the
> > > >>> original nodes (both are Ubuntu 22.04) is the kernel version,
> > > >>> which is "5.15.0-89-generic #99-Ubuntu SMP" on the functioning
> > > >>> nodes and "5.15.0-91-generic #101-Ubuntu SMP" on the updated
> > > >>> nodes. I could not figure out how to install the exact first
> > > >>> kernel version on the updated nodes, but I noticed that when I
> > > >>> reinstall 5.15.0 with this tool:
> > > >>> https://github.com/pimlie/ubuntu-mainline-kernel.sh, the error
> > > >>> message disappears. However, once I do that, the network driver
> > > >>> does not function properly anymore, so this does not seem to be
> > > >>> a good solution.
> > > >>>
> > > >>> Has anyone seen this issue before or is there maybe something
> > > >>> else I should take a look at? I am also happy to just find a
> > > >>> workaround such that I can take these nodes back online.
> > > >>>
> > > >>> I appreciate any help!
> > > >>>
> > > >>> Thanks a lot in advance and best wishes,
> > > >>>
> > > >>> Tim
> > > >>>
> > > >>>
> > > >>>
> >
>


-- 
Cristóbal A. Navarro

Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

Reply via email to