On Tue, 25 Jun 2024 at 10:16:20 +0200, Helmut Grohne wrote:
> In this work, limitations with --chroot-mode=unshare became apparent and
> that lead to Johannes, Jochen and me sitting down in Berlin pondering
> ideas on how to improve the situation. That is a longer story, but
> eventually Timo Röhling asked the innocuous question of why we cannot
> just use schroot and make it work with namespaces.

I have to ask:

Could we use a container framework that is also used outside the Debian
bubble, rather than writing our own from first principles every time, and
ending up with a single-maintainer project being load-bearing for Debian
*again*? I had hoped that after sbuild's history with schroot becoming
unmaintained, and then being revived by a maintainer-of-last-resort who
is one of the same few people who are critical-path for various other
important things, we would recognise that as an anti-pattern that we
should avoid if we can.

At the moment, rootless Podman would seem like the obvious choice. As far
as I'm aware, it has the same user namespaces requirements as the unshare
backends in mmdebstrap, autopkgtest and schroot (user namespaces enabled,
setuid newuidmap, 65536 uids in /etc/subuid, 65536 gids in /etc/subgid).

Podman uses the same OCI images as Docker, so it can either pull from a
trusted OCI registry, or use images that were built by importing a tarball
generated by e.g. mmdebstrap or sbuild-createchroot. I assume that for
Debian we would want to do the latter, at least initially, to avoid
being forced to either trust an external registry like hub.docker.com
or operate our own.

Here's the Dockerfile/Containerfile to turn a sysroot tarball into an
OCI image (obviously it can be extended with LABELs and other
customizations, but this is fairly close to minimal):

    FROM scratch
    ADD sysroot.tar.gz /
    CMD ["/bin/bash"]

The reason I suggest Podman rather than Docker is that Podman is normally
"daemonless" (the container is an ordinary process tree, like schroot,
rather than being launched by command-execution RPC to dockerd) and
is normally used "rootless" (whereas Docker *can* be configured to be
"rootless" but in practice it seems that's very uncommon).

podman is also supported as a backend by autopkgtest-virt-podman, Toolbx
(podman-toolbox in Debian) and distrobox. autopkgtest's
autopkgtest-build-podman does not yet support starting from a tarball
as described above, but it easily could (contributions welcome).

Or, if Podman is too "not invented here" for Debian's use, using rootless
lxd/Incus is another option - although that introduces a dependency
on projects and formats that are rarely used outside the Debian/Ubuntu
bubble, which risks them becoming another schroot (and also requires us to
decide whether we follow Canonical's lxd or the community fork Incus
post-fork, which could get somewhat political).

> There are two approaches to
> managing an ephemeral build container using namespaces. In one approach,
> we create a directory hierarchy of a container root filesystem and for
> each command and hook that we invoke there, we create new namespaces on
> demand. In particular, there are no background processes when nothing is
> running in that container and all that remains is its directory
> hierarchy. Such a container session can easily survive a reboot (unless
> stored on tmpfs). Both sbuild --chroot-mode=unshare and unschroot.py
> follow this approach. For comparison, schroot sets up mounts (e.g /proc)
> when it begins a session and cleans them up when it ends. No such
> persistent mounts exist in either sbuild --chroot-mode=unshare or
> unschroot.py.

Persisting a container root filesystem between multiple operations comes
with some serious correctness issues if there are "hooks" that can modify
it destructively on each operation: see <https://bugs.debian.org/499014>
and <https://bugs.debian.org/994836>. As a result of that, I think the
only model that should be used in new systems is to have some concept of
a session (like schroot type=file, but unlike schroot type=directory)
so that those "hooks" only run once, on session creation, preventing
them from arbitrarily reverting/overwriting changes that are subsequently
made by packages installed into the chroot/container (for example dbus'
creation of the messagebus uid/gid in #499014, and exim4's creation of
Debian-exim in #994836).

I don't know whether creating new namespaces multiple times (but without
running external integration hooks the second and subsequent times)
will also lead to practical problems, but I note that outside the Debian
bubble, everything that enters a new container environment seems to
operate by creating a process that encapsulates the container, and then
either letting it run to completion interactively or non-interactively
(`docker run`, etc.), or letting it run in the background (perhaps with
an init system or `sleep infinity` as its "payload" process) and then
repeatedly injecting code into that pre-existing namespace
(either `docker exec`, etc., or something like ssh).

autopkgtest's Docker, Podman, lxc, lxd backends all operate by creating
a namespaced init or sleep process with `docker run` or equivalent, and
then injecting subsequent commands into the namespace that was created
for that long-running process with `docker exec` or equivalent.
I think unshare is the outlier here, and I think it would be good to
consider whether it really needs to be.

The more like other container managers a new container manager is, the
less likely it is to break reasonable expectations in future, like
schroot regularly does.

> While podman
> and docker allow running unprivileged application containers, they still
> require privileged containers when you want to run systemd-as-pid-1.

What do you mean by "privileged containers" exactly? Do you mean a system
service that runs with CAP_SYS_ADMIN and other scary privileges in the
init namespace, like the typical use of dockerd, or are you also counting
uses of the setuid newuidmap as being privileged?

If you are happy to use the setuid newuidmap (which I believe the unshare
backends for schroot, mmdebstrap, autopkgtest also rely on) then my
understanding is that "rootless" podman is essentially equivalent:
you need a setuid newuidmap, a range of 65536 uids in /etc/subuid,
a range of 65536 gids in /etc/subgid, and a kernel that will allow
unprivileged users to create new user namespaces, but beyond that there
are no special privileges required.

Please see /usr/share/doc/podman/README.Debian for details of what it needs.

For systemd-as-pid-1 specifically,
`autopkgtest-build-podman --init=systemd` and
`autopkgtest-virt-podman --init` demonstrate how this can be done, and
last time I tried, it was possible to run them unprivileged (other than
needing access to the setuid newuidmap, as above). systemd is able to
detect that it's running in a container and turn off functionality like
udev that would only be appropriate in a VM or on bare metal, and podman
knows how to tell systemd that it should do this.

    smcv

Reply via email to