On Tue, 25 Jun 2024 at 10:16:20 +0200, Helmut Grohne wrote: > In this work, limitations with --chroot-mode=unshare became apparent and > that lead to Johannes, Jochen and me sitting down in Berlin pondering > ideas on how to improve the situation. That is a longer story, but > eventually Timo Röhling asked the innocuous question of why we cannot > just use schroot and make it work with namespaces.
I have to ask: Could we use a container framework that is also used outside the Debian bubble, rather than writing our own from first principles every time, and ending up with a single-maintainer project being load-bearing for Debian *again*? I had hoped that after sbuild's history with schroot becoming unmaintained, and then being revived by a maintainer-of-last-resort who is one of the same few people who are critical-path for various other important things, we would recognise that as an anti-pattern that we should avoid if we can. At the moment, rootless Podman would seem like the obvious choice. As far as I'm aware, it has the same user namespaces requirements as the unshare backends in mmdebstrap, autopkgtest and schroot (user namespaces enabled, setuid newuidmap, 65536 uids in /etc/subuid, 65536 gids in /etc/subgid). Podman uses the same OCI images as Docker, so it can either pull from a trusted OCI registry, or use images that were built by importing a tarball generated by e.g. mmdebstrap or sbuild-createchroot. I assume that for Debian we would want to do the latter, at least initially, to avoid being forced to either trust an external registry like hub.docker.com or operate our own. Here's the Dockerfile/Containerfile to turn a sysroot tarball into an OCI image (obviously it can be extended with LABELs and other customizations, but this is fairly close to minimal): FROM scratch ADD sysroot.tar.gz / CMD ["/bin/bash"] The reason I suggest Podman rather than Docker is that Podman is normally "daemonless" (the container is an ordinary process tree, like schroot, rather than being launched by command-execution RPC to dockerd) and is normally used "rootless" (whereas Docker *can* be configured to be "rootless" but in practice it seems that's very uncommon). podman is also supported as a backend by autopkgtest-virt-podman, Toolbx (podman-toolbox in Debian) and distrobox. autopkgtest's autopkgtest-build-podman does not yet support starting from a tarball as described above, but it easily could (contributions welcome). Or, if Podman is too "not invented here" for Debian's use, using rootless lxd/Incus is another option - although that introduces a dependency on projects and formats that are rarely used outside the Debian/Ubuntu bubble, which risks them becoming another schroot (and also requires us to decide whether we follow Canonical's lxd or the community fork Incus post-fork, which could get somewhat political). > There are two approaches to > managing an ephemeral build container using namespaces. In one approach, > we create a directory hierarchy of a container root filesystem and for > each command and hook that we invoke there, we create new namespaces on > demand. In particular, there are no background processes when nothing is > running in that container and all that remains is its directory > hierarchy. Such a container session can easily survive a reboot (unless > stored on tmpfs). Both sbuild --chroot-mode=unshare and unschroot.py > follow this approach. For comparison, schroot sets up mounts (e.g /proc) > when it begins a session and cleans them up when it ends. No such > persistent mounts exist in either sbuild --chroot-mode=unshare or > unschroot.py. Persisting a container root filesystem between multiple operations comes with some serious correctness issues if there are "hooks" that can modify it destructively on each operation: see <https://bugs.debian.org/499014> and <https://bugs.debian.org/994836>. As a result of that, I think the only model that should be used in new systems is to have some concept of a session (like schroot type=file, but unlike schroot type=directory) so that those "hooks" only run once, on session creation, preventing them from arbitrarily reverting/overwriting changes that are subsequently made by packages installed into the chroot/container (for example dbus' creation of the messagebus uid/gid in #499014, and exim4's creation of Debian-exim in #994836). I don't know whether creating new namespaces multiple times (but without running external integration hooks the second and subsequent times) will also lead to practical problems, but I note that outside the Debian bubble, everything that enters a new container environment seems to operate by creating a process that encapsulates the container, and then either letting it run to completion interactively or non-interactively (`docker run`, etc.), or letting it run in the background (perhaps with an init system or `sleep infinity` as its "payload" process) and then repeatedly injecting code into that pre-existing namespace (either `docker exec`, etc., or something like ssh). autopkgtest's Docker, Podman, lxc, lxd backends all operate by creating a namespaced init or sleep process with `docker run` or equivalent, and then injecting subsequent commands into the namespace that was created for that long-running process with `docker exec` or equivalent. I think unshare is the outlier here, and I think it would be good to consider whether it really needs to be. The more like other container managers a new container manager is, the less likely it is to break reasonable expectations in future, like schroot regularly does. > While podman > and docker allow running unprivileged application containers, they still > require privileged containers when you want to run systemd-as-pid-1. What do you mean by "privileged containers" exactly? Do you mean a system service that runs with CAP_SYS_ADMIN and other scary privileges in the init namespace, like the typical use of dockerd, or are you also counting uses of the setuid newuidmap as being privileged? If you are happy to use the setuid newuidmap (which I believe the unshare backends for schroot, mmdebstrap, autopkgtest also rely on) then my understanding is that "rootless" podman is essentially equivalent: you need a setuid newuidmap, a range of 65536 uids in /etc/subuid, a range of 65536 gids in /etc/subgid, and a kernel that will allow unprivileged users to create new user namespaces, but beyond that there are no special privileges required. Please see /usr/share/doc/podman/README.Debian for details of what it needs. For systemd-as-pid-1 specifically, `autopkgtest-build-podman --init=systemd` and `autopkgtest-virt-podman --init` demonstrate how this can be done, and last time I tried, it was possible to run them unprivileged (other than needing access to the setuid newuidmap, as above). systemd is able to detect that it's running in a container and turn off functionality like udev that would only be appropriate in a VM or on bare metal, and podman knows how to tell systemd that it should do this. smcv