Hi Simon, On Tue, Jun 25, 2024 at 02:02:11PM +0100, Simon McVittie wrote: > Could we use a container framework that is also used outside the Debian > bubble, rather than writing our own from first principles every time, and > ending up with a single-maintainer project being load-bearing for Debian > *again*? I had hoped that after sbuild's history with schroot becoming > unmaintained, and then being revived by a maintainer-of-last-resort who > is one of the same few people who are critical-path for various other > important things, we would recognise that as an anti-pattern that we > should avoid if we can. This is a reasonable concern. I contend that while unschroot.py is very Debian-specific, the underlying plumbing layer is not. I would not have started working on this if what I wanted to do was doable with existing code, but maybe it was not the code didn't do it, but me not using the existing code correctly.
Please allow me to point out that right now, sbuild contains a custom container framework that is subject to eventually becoming a starving single-maintainer project and I am trying to extract and separate this existing container framework from sbuild into more reusable components. Likewise, mmdebstrap contains another custom container framework that is similar but not equal to the one in sbuild. > At the moment, rootless Podman would seem like the obvious choice. As far > as I'm aware, it has the same user namespaces requirements as the unshare > backends in mmdebstrap, autopkgtest and schroot (user namespaces enabled, > setuid newuidmap, 65536 uids in /etc/subuid, 65536 gids in /etc/subgid). I concur, the privilege requirements for rootless podman are exactly the ones I am interested in. Indeed, podman was the thing investigated most thoroughly, but evidently not thoroughly enough. > Podman uses the same OCI images as Docker, so it can either pull from a > trusted OCI registry, or use images that were built by importing a tarball > generated by e.g. mmdebstrap or sbuild-createchroot. I assume that for > Debian we would want to do the latter, at least initially, to avoid > being forced to either trust an external registry like hub.docker.com > or operate our own. At least for me, building container images locally is a requirement. I have no interest in using a container registry. Faidon pointing at --roofs goes further into this direction. > podman is also supported as a backend by autopkgtest-virt-podman, Toolbx > (podman-toolbox in Debian) and distrobox. autopkgtest's > autopkgtest-build-podman does not yet support starting from a tarball > as described above, but it easily could (contributions welcome). Thank you for pointing at these. I need to familiarize myself with them. > Or, if Podman is too "not invented here" for Debian's use, using rootless > lxd/Incus is another option - although that introduces a dependency > on projects and formats that are rarely used outside the Debian/Ubuntu > bubble, which risks them becoming another schroot (and also requires us to > decide whether we follow Canonical's lxd or the community fork Incus > post-fork, which could get somewhat political). lxd/incus also was on my list, but my understanding is that they do not work without their system services at all and being able to operate containers (i.e. being incus-admin or the like) roughly becomes equivalent to being full root on the system defeating the purpose of the exercise. If anything is "not invented here", that'd be unschroot rather than podman. > > There are two approaches to > > managing an ephemeral build container using namespaces. In one approach, > > we create a directory hierarchy of a container root filesystem and for > > each command and hook that we invoke there, we create new namespaces on > > demand. In particular, there are no background processes when nothing is > > running in that container and all that remains is its directory > > hierarchy. Such a container session can easily survive a reboot (unless > > stored on tmpfs). Both sbuild --chroot-mode=unshare and unschroot.py > > follow this approach. For comparison, schroot sets up mounts (e.g /proc) > > when it begins a session and cleans them up when it ends. No such > > persistent mounts exist in either sbuild --chroot-mode=unshare or > > unschroot.py. > > Persisting a container root filesystem between multiple operations comes > with some serious correctness issues if there are "hooks" that can modify > it destructively on each operation: see <https://bugs.debian.org/499014> > and <https://bugs.debian.org/994836>. As a result of that, I think the > only model that should be used in new systems is to have some concept of > a session (like schroot type=file, but unlike schroot type=directory) > so that those "hooks" only run once, on session creation, preventing > them from arbitrarily reverting/overwriting changes that are subsequently > made by packages installed into the chroot/container (for example dbus' > creation of the messagebus uid/gid in #499014, and exim4's creation of > Debian-exim in #994836). I guess you understood my explanation differently than it was meant. While the container is persisted into the filesystem, this is being done for each package build individually. sbuild --chroot-mode=unshare and unschroot use a tarball as their source and opening the session amounts to extracting it. At the end of the session, the tree is disposed. The session concept of schroot is being reused in unschroot and it very much behaves like a type=file chroot except that you can begin a session, reboot and continue using it until you end it without requiring a system service to recover your sessions during boot. The main difference to how everyone else does this is that in a typical sbuild interaction it will create a new user namespace for every single command run as part of the session. sbuild issues tens of commands before launching dpkg-buildpackage and each of them creates new namespaces in the Linux kernel (all of them using the same uid mappings, performing the same bind mounts and so on). The most common way to think of containers is different: You create those namespaces once and reuse the same namespace kernel objects for multiple commands part of the same session (e.g. installation of build dependencies and dpkg-buildpackage). You describe this other approach in more detail: > I don't know whether creating new namespaces multiple times (but without > running external integration hooks the second and subsequent times) > will also lead to practical problems, but I note that outside the Debian > bubble, everything that enters a new container environment seems to > operate by creating a process that encapsulates the container, and then > either letting it run to completion interactively or non-interactively > (`docker run`, etc.), or letting it run in the background (perhaps with > an init system or `sleep infinity` as its "payload" process) and then > repeatedly injecting code into that pre-existing namespace > (either `docker exec`, etc., or something like ssh). Exactly, this is how everyone but sbuild --chroot-mode=unshare and unschroot do it. > autopkgtest's Docker, Podman, lxc, lxd backends all operate by creating > a namespaced init or sleep process with `docker run` or equivalent, and > then injecting subsequent commands into the namespace that was created > for that long-running process with `docker exec` or equivalent. Please allow me to do a tangential excursion here. There two ways of interacting with containers that use one set of namespaces for their entire existence. One is setting up some IPC mechanism and receiving commands to be run inside (for instance spawning a shell and piping commands into it or driving the container via ssh) or an external process joins (setns) the existing container (namespaces) and injects code into it (docker exec). That latter approach has a history of vulnerabilities closely related to vulnerabilities in setuid binaries, because we are transitioning a process (and all of its context) from outside the container into it and thus expose all of its context (memory maps, open file descriptors and so on) to contained processes. As such, I think that an approach based on an IPC mechanism should be preferred. I am not sure whether podman exec operates in this way, but a quick codesearch did not exhibit obvious uses of setns inside the podman source code. Would anyone be able to tell how podman exec is implemented here? > I think unshare is the outlier here, and I think it would be good to > consider whether it really needs to be. Absolutely! Did you observe that I suggested moving unschroot to that other model where the namespace objects are reused for the entire session? Indeed, moving sbuild --chroot-mode=unshare in this direction was one of the primary motivations for starting this work, but doing this inside sbuild is very difficult due to its architecture, so my approach was first separating the container framework from sbuild and that's how I arrived at unschroot. > The more like other container managers a new container manager is, the > less likely it is to break reasonable expectations in future, like > schroot regularly does. Yes! I very much used the systemd container interface documentation to avoid exactly this problem. > > While podman > > and docker allow running unprivileged application containers, they still > > require privileged containers when you want to run systemd-as-pid-1. > > What do you mean by "privileged containers" exactly? Do you mean a system > service that runs with CAP_SYS_ADMIN and other scary privileges in the > init namespace, like the typical use of dockerd, or are you also counting > uses of the setuid newuidmap as being privileged? I'm sorry for being imprecise here. Privileged is an overloaded term in the container context. I was trying to use it with the "not rootless" meaning here. The interest is in running containers with user privileges available on common installations (i.e. unprivileged user namespaces, newuidmap being setuid, subuid allocation and systemd being your cgroup manager and handing out delegated cgroups). > If you are happy to use the setuid newuidmap (which I believe the unshare > backends for schroot, mmdebstrap, autopkgtest also rely on) then my > understanding is that "rootless" podman is essentially equivalent: > you need a setuid newuidmap, a range of 65536 uids in /etc/subuid, > a range of 65536 gids in /etc/subgid, and a kernel that will allow > unprivileged users to create new user namespaces, but beyond that there > are no special privileges required. Cool. I think you really need one more non-trivial (but very commonly available) privilege. You need a cgroup manager (such as systemd) that allows creating and delegating a cgroup hierarchy to you. You may call this a non-special privilege. > Please see /usr/share/doc/podman/README.Debian for details of what it needs. It could use updating as swapaccount=1 is the default. > For systemd-as-pid-1 specifically, > `autopkgtest-build-podman --init=systemd` and > `autopkgtest-virt-podman --init` demonstrate how this can be done, and > last time I tried, it was possible to run them unprivileged (other than > needing access to the setuid newuidmap, as above). systemd is able to > detect that it's running in a container and turn off functionality like > udev that would only be appropriate in a VM or on bare metal, and podman > knows how to tell systemd that it should do this. This is very cool. Running autopkgtests in system containers without being root (or incus-admin) very much is what I'd like to do. And it's much better if I don't have to write my own container framework for doing it. I couldn't get it to work locally yet (facing non-obvious error messages). Would someone be able to document (mail/wiki/blog/...) how to set up and use podman for running autopkgtests. Thus far, I failed to figure out how to plug a local Debian mirror (as opposed to a container registry) into autopkgtest-build-podman. It is quite difficult to locate podman documentation that is applicable under the assumption that you don't want to use any container registry. So thank you very much for pointing me hard at podman again. My podman research dates back quite a bit and I can already tell that podman is quite a bit different now. Let me circle back to the question of whether podman solves the needs of sbuild. We learned that sbuild --chroot-mode=unshare and unschroot spawn a new set of namespaces for every command. What you point out as a limitation also is a feature. Technically, it is a lie that the namespaces are always constructed in the same way. During installation of build depends the network namespace is not unshared while package builds commonly use an unshared network namespace with no interfaces but the loopback interface. In a similar vein, constructing a pid namespace for every command ensures reliable process cleanup: Once your build has exited, all background processes are reliably disposed. These aspects are very useful to how we use containers in sbuild, but the way most container runtimes work with a single set of namespaces makes this non-trivial. We really want to change the set of namespaces throughout the session. So I think the needs of sbuild (and piuparts) about container frameworks are quite specific and not easily met by existing tools. Ultimately, this is what lead me into writing a reusable Python module providing container plumbing and a relatively thin implementation of schroot using namespaces on top of it. If we can get the requested features from podman, choosing it is the better choice to me for the maintainability reasons that you started with. It is not clear though whether podman can be made to address our requirements. Thank you for having taken one step back and questioning my context instead of going into my actual questions. Helmut