In src/plugins/cgroup/v2/ebpf.c, comment out logging. I.e. change
attr.log_level = 1;
attr.log_buf = (size_t) log;
attr.log_size = sizeof(log);
to
attr.log_level = 0;
attr.log_buf = NULL;
attr.log_size = 0;
I think you'll find that this fixes it.
I have no idea why this is a problem in this specific kernel release
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-signed in Ubuntu.
https://bugs.launchpad.net/bugs/2050098
Title:
cgroup2 broken since 5.15.0-90-generic?
Status in linux-signed package in Ubuntu:
Confirmed
Bug description:
We're using Slurm workload manager in a cluster with Ubuntu 22.04 and
the linux-generic kernel (amd64). We use cgroups (cgroup2) for
resource allocation with Slurm. With kernel version
linux-image-5.15.0-91-generic 5.15.0-91.101
amd64
I'm seeing a new issue. This must have been introduced recently, I can
confirm that with kernel 5.15.0-88-generic the issue does not exist.
When I request a single GPU on a node with kernel 5.15.0-88-generic
all is well:
$ srun -G 1 -w gpu59 nvidia-smi -L
GPU 0: NVIDIA [...]
Instead with kernel 5.15.0-91-generic:
$ srun -G 1 -w gpu59 nvidia-smi -L
slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device).
Please check your system limits (MEMLOCK).
GPU 0: NVIDIA [...]
GPU 1: NVIDIA [...]
GPU 2: NVIDIA [...]
GPU 3: NVIDIA [...]
GPU 4: NVIDIA [...]
GPU 5: NVIDIA [...]
GPU 6: NVIDIA [...]
GPU 7: NVIDIA [...]
So I get this error regarding MEMLOCK limit and see all GPUs in the
system instead of only the one requested. Hence I assume the problem
is related to cgroups.
$ cat /proc/version_signature
Ubuntu 5.15.0-91.101-generic 5.15.131
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-signed/+bug/2050098/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp