[systemd-devel] Making /run respect Container Memory Limits

Matthew Ife Sat, 21 Sep 2024 08:27:05 -0700

I have a question regarding a bug and (primarily philosophical) approach to 
take to fix the bug.


Note, I'm not requesting the bug be fixed -- I'm happy to supply any patches, I 
am just keen to get some direction on the approach to take.

So, the bug appears due to very early mount decisions made at the very start of 
systemd's startup.

When it starts it makes its initial mount points, one of which is /run. This 
path is rigged to statically always be 20% space.

The problem here is that for big memory hosts (lets say for arguments sake 256 
GiB) that hosts many very small containers with cgroup memory limits set (lets 
go with 2 GiB) the default configuration of systemd leads to an inevitable 
lockup. This feels like
a design oversight to me.

---

Whats basically happening is the following:
- we define a 2 GiB container and start the init process.
- 20% of host space is allocated to the container (52 GiB)
- systemd-journald starts and logs into `/run`. It itself is configured to only 
use 15% of the 52 GiB (7.8 GiB).
- time passes and logs accumulate in the containers /run until the /run path is 
close to or exceeding 2 GiB.
- container becomes constrained, OOM killer kills off processes.
- journal logs these oom kills too further using up space.
- this cycle continues until inevitably there is no longer any memory left, any 
valid processes to kill and the system becomes completely stalled and locked up.

This locked up behaviour is not very obvious for normal users to understand how 
a container gets completely locked up when it would appear to not be running 
anything.
Indeed, the ultimate fate of a container like this is an inevitable lockup 
since logging is going to log and consume tmpfs space beyond the constraints of 
the container.

---

After perusing around some of the code there seems to a fwe approaches to 
fixing this.

1. Add some code to `mount_setup` that goes back over `/run` and recalculates a 
value that respects the containers constraints in some sensible way (16MiB or 
20% of the containers usage, whatever is higher) whenever a container type is 
detected. Then
issuing a remount.
2. Much like 1, but instead adding a function pointer field to `struct 
MointPoint` thats associated with /run to effectively do the same thing, but in 
a manner thats generalized enough it could be used for future stuff. (/dev 
might also benefit, but its
not as bad.)

However, the problem with 1 or 2 is that Ubuntu (probably other Debians) really 
dont like you remounting tmpfs filesystems in a user namespace later on down 
the line (in LXC, I'm aware of this). This is due to apparmor constraints.
To get around that problem one could actually write some code at `mount_setup` 
which unmounts and mounts `/run` again with correct sizes, avoiding triggering 
a security alert, providing nothing has been written to it (I haven't checked 
that bit). This also
feels pretty ugly.

So the philosophical aspect is, do you consider fixes where you anticipate the 
solution to be ineffective due to changes required by a distro provider? Or 
would be more suitable to fix the early boot process to not require distro 
providers to change their
software?

The trickier, 'proper' fix to all this is not to go back over run but to mount 
it with the correct size from the start. The problem here is that cgroupfs is 
mounted after (not during early boot) so its effectively impossible to 
calculate tmpfs sizes.
This would require refactoring early boot stuff to mount cgroupfs even earlier. 
That makes me pretty nervous as it feels as if some concious effort went into 
what gets mounted and when. I'm quite aware of cgruopv1 and cgroupv2 complexity 
to add to all this
also.

Finally another fix (and a fix we're using in production) is to simply remount 
with `/etc/fstab` the `/run` path with manually calculated sizes pertaining to 
the container. I'm assuming this could also be done with a generator unit of 
some kind too --
again -- this falls foul to distro level security issues of apparmor not 
allowing remounts. Our production solution is to disable the apparmor profile 
for LXC (sigh).

In any case, I'm really just looking for some direction before I consider some 
patches regarding this problem.

Whilst its workaround-able late boot, it seems to be a design oversight that 
its possible to produce a systemd system which is guaranteed to fail at some 
point in the future due to bad mount options, it feels to me a more sensible 
approach to mount options
should be done in code at early boot.

[systemd-devel] Making /run respect Container Memory Limits

Reply via email to