Re: Reviving schroot as used by sbuild

Helmut Grohne Tue, 25 Jun 2024 09:57:59 -0700

Hi Simon,

On Tue, Jun 25, 2024 at 02:02:11PM +0100, Simon McVittie wrote:
> Could we use a container framework that is also used outside the Debian
> bubble, rather than writing our own from first principles every time, and
> ending up with a single-maintainer project being load-bearing for Debian
> *again*? I had hoped that after sbuild's history with schroot becoming
> unmaintained, and then being revived by a maintainer-of-last-resort who
> is one of the same few people who are critical-path for various other
> important things, we would recognise that as an anti-pattern that we
> should avoid if we can.

This is a reasonable concern. I contend that while unschroot.py is very
Debian-specific, the underlying plumbing layer is not. I would not have
started working on this if what I wanted to do was doable with existing
code, but maybe it was not the code didn't do it, but me not using the
existing code correctly.


Please allow me to point out that right now, sbuild contains a custom
container framework that is subject to eventually becoming a starving
single-maintainer project and I am trying to extract and separate this
existing container framework from sbuild into more reusable components.
Likewise, mmdebstrap contains another custom container framework that is
similar but not equal to the one in sbuild.

> At the moment, rootless Podman would seem like the obvious choice. As far
> as I'm aware, it has the same user namespaces requirements as the unshare
> backends in mmdebstrap, autopkgtest and schroot (user namespaces enabled,
> setuid newuidmap, 65536 uids in /etc/subuid, 65536 gids in /etc/subgid).

I concur, the privilege requirements for rootless podman are exactly the
ones I am interested in. Indeed, podman was the thing investigated most
thoroughly, but evidently not thoroughly enough.

> Podman uses the same OCI images as Docker, so it can either pull from a
> trusted OCI registry, or use images that were built by importing a tarball
> generated by e.g. mmdebstrap or sbuild-createchroot. I assume that for
> Debian we would want to do the latter, at least initially, to avoid
> being forced to either trust an external registry like hub.docker.com
> or operate our own.

At least for me, building container images locally is a requirement. I
have no interest in using a container registry. Faidon pointing at
--roofs goes further into this direction.

> podman is also supported as a backend by autopkgtest-virt-podman, Toolbx
> (podman-toolbox in Debian) and distrobox. autopkgtest's
> autopkgtest-build-podman does not yet support starting from a tarball
> as described above, but it easily could (contributions welcome).

Thank you for pointing at these. I need to familiarize myself with them.

> Or, if Podman is too "not invented here" for Debian's use, using rootless
> lxd/Incus is another option - although that introduces a dependency
> on projects and formats that are rarely used outside the Debian/Ubuntu
> bubble, which risks them becoming another schroot (and also requires us to
> decide whether we follow Canonical's lxd or the community fork Incus
> post-fork, which could get somewhat political).

lxd/incus also was on my list, but my understanding is that they do not
work without their system services at all and being able to operate
containers (i.e. being incus-admin or the like) roughly becomes
equivalent to being full root on the system defeating the purpose of the
exercise. If anything is "not invented here", that'd be unschroot rather
than podman.

> > There are two approaches to
> > managing an ephemeral build container using namespaces. In one approach,
> > we create a directory hierarchy of a container root filesystem and for
> > each command and hook that we invoke there, we create new namespaces on
> > demand. In particular, there are no background processes when nothing is
> > running in that container and all that remains is its directory
> > hierarchy. Such a container session can easily survive a reboot (unless
> > stored on tmpfs). Both sbuild --chroot-mode=unshare and unschroot.py
> > follow this approach. For comparison, schroot sets up mounts (e.g /proc)
> > when it begins a session and cleans them up when it ends. No such
> > persistent mounts exist in either sbuild --chroot-mode=unshare or
> > unschroot.py.
> 
> Persisting a container root filesystem between multiple operations comes
> with some serious correctness issues if there are "hooks" that can modify
> it destructively on each operation: see <https://bugs.debian.org/499014>
> and <https://bugs.debian.org/994836>. As a result of that, I think the
> only model that should be used in new systems is to have some concept of
> a session (like schroot type=file, but unlike schroot type=directory)
> so that those "hooks" only run once, on session creation, preventing
> them from arbitrarily reverting/overwriting changes that are subsequently
> made by packages installed into the chroot/container (for example dbus'
> creation of the messagebus uid/gid in #499014, and exim4's creation of
> Debian-exim in #994836).

I guess you understood my explanation differently than it was meant.
While the container is persisted into the filesystem, this is being done
for each package build individually. sbuild --chroot-mode=unshare and
unschroot use a tarball as their source and opening the session amounts
to extracting it. At the end of the session, the tree is disposed. The
session concept of schroot is being reused in unschroot and it very much
behaves like a type=file chroot except that you can begin a session,
reboot and continue using it until you end it without requiring a system
service to recover your sessions during boot.

The main difference to how everyone else does this is that in a typical
sbuild interaction it will create a new user namespace for every single
command run as part of the session. sbuild issues tens of commands
before launching dpkg-buildpackage and each of them creates new
namespaces in the Linux kernel (all of them using the same uid mappings,
performing the same bind mounts and so on). The most common way to think
of containers is different: You create those namespaces once and reuse
the same namespace kernel objects for multiple commands part of the same
session (e.g. installation of build dependencies and dpkg-buildpackage).
You describe this other approach in more detail:

> I don't know whether creating new namespaces multiple times (but without
> running external integration hooks the second and subsequent times)
> will also lead to practical problems, but I note that outside the Debian
> bubble, everything that enters a new container environment seems to
> operate by creating a process that encapsulates the container, and then
> either letting it run to completion interactively or non-interactively
> (`docker run`, etc.), or letting it run in the background (perhaps with
> an init system or `sleep infinity` as its "payload" process) and then
> repeatedly injecting code into that pre-existing namespace
> (either `docker exec`, etc., or something like ssh).

Exactly, this is how everyone but sbuild --chroot-mode=unshare and
unschroot do it.

> autopkgtest's Docker, Podman, lxc, lxd backends all operate by creating
> a namespaced init or sleep process with `docker run` or equivalent, and
> then injecting subsequent commands into the namespace that was created
> for that long-running process with `docker exec` or equivalent.

Please allow me to do a tangential excursion here. There two ways of
interacting with containers that use one set of namespaces for their
entire existence. One is setting up some IPC mechanism and receiving
commands to be run inside (for instance spawning a shell and piping
commands into it or driving the container via ssh) or an external
process joins (setns) the existing container (namespaces) and injects
code into it (docker exec). That latter approach has a history of
vulnerabilities closely related to vulnerabilities in setuid binaries,
because we are transitioning a process (and all of its context) from
outside the container into it and thus expose all of its context (memory
maps, open file descriptors and so on) to contained processes. As such,
I think that an approach based on an IPC mechanism should be preferred.
I am not sure whether podman exec operates in this way, but a quick
codesearch did not exhibit obvious uses of setns inside the podman
source code.  Would anyone be able to tell how podman exec is
implemented here?

> I think unshare is the outlier here, and I think it would be good to
> consider whether it really needs to be.

Absolutely! Did you observe that I suggested moving unschroot to that
other model where the namespace objects are reused for the entire
session? Indeed, moving sbuild --chroot-mode=unshare in this direction
was one of the primary motivations for starting this work, but doing
this inside sbuild is very difficult due to its architecture, so my
approach was first separating the container framework from sbuild and
that's how I arrived at unschroot.

> The more like other container managers a new container manager is, the
> less likely it is to break reasonable expectations in future, like
> schroot regularly does.

Yes! I very much used the systemd container interface documentation to
avoid exactly this problem.

> > While podman
> > and docker allow running unprivileged application containers, they still
> > require privileged containers when you want to run systemd-as-pid-1.
> 
> What do you mean by "privileged containers" exactly? Do you mean a system
> service that runs with CAP_SYS_ADMIN and other scary privileges in the
> init namespace, like the typical use of dockerd, or are you also counting
> uses of the setuid newuidmap as being privileged?

I'm sorry for being imprecise here. Privileged is an overloaded term in
the container context. I was trying to use it with the "not rootless"
meaning here. The interest is in running containers with user privileges
available on common installations (i.e. unprivileged user namespaces,
newuidmap being setuid, subuid allocation and systemd being your cgroup
manager and handing out delegated cgroups).

> If you are happy to use the setuid newuidmap (which I believe the unshare
> backends for schroot, mmdebstrap, autopkgtest also rely on) then my
> understanding is that "rootless" podman is essentially equivalent:
> you need a setuid newuidmap, a range of 65536 uids in /etc/subuid,
> a range of 65536 gids in /etc/subgid, and a kernel that will allow
> unprivileged users to create new user namespaces, but beyond that there
> are no special privileges required.

Cool. I think you really need one more non-trivial (but very commonly
available) privilege. You need a cgroup manager (such as systemd) that
allows creating and delegating a cgroup hierarchy to you. You may call
this a non-special privilege.

> Please see /usr/share/doc/podman/README.Debian for details of what it needs.

It could use updating as swapaccount=1 is the default.

> For systemd-as-pid-1 specifically,
> `autopkgtest-build-podman --init=systemd` and
> `autopkgtest-virt-podman --init` demonstrate how this can be done, and
> last time I tried, it was possible to run them unprivileged (other than
> needing access to the setuid newuidmap, as above). systemd is able to
> detect that it's running in a container and turn off functionality like
> udev that would only be appropriate in a VM or on bare metal, and podman
> knows how to tell systemd that it should do this.

This is very cool. Running autopkgtests in system containers without
being root (or incus-admin) very much is what I'd like to do. And it's
much better if I don't have to write my own container framework for
doing it. I couldn't get it to work locally yet (facing non-obvious
error messages).

Would someone be able to document (mail/wiki/blog/...) how to set up and
use podman for running autopkgtests. Thus far, I failed to figure out
how to plug a local Debian mirror (as opposed to a container registry)
into autopkgtest-build-podman. It is quite difficult to locate podman
documentation that is applicable under the assumption that you don't
want to use any container registry.

So thank you very much for pointing me hard at podman again. My podman
research dates back quite a bit and I can already tell that podman is
quite a bit different now.

Let me circle back to the question of whether podman solves the needs of
sbuild. We learned that sbuild --chroot-mode=unshare and unschroot spawn
a new set of namespaces for every command. What you point out as a
limitation also is a feature. Technically, it is a lie that the
namespaces are always constructed in the same way. During installation
of build depends the network namespace is not unshared while package
builds commonly use an unshared network namespace with no interfaces but
the loopback interface. In a similar vein, constructing a pid namespace
for every command ensures reliable process cleanup: Once your build has
exited, all background processes are reliably disposed. These aspects
are very useful to how we use containers in sbuild, but the way most
container runtimes work with a single set of namespaces makes this
non-trivial. We really want to change the set of namespaces throughout
the session.

So I think the needs of sbuild (and piuparts) about container frameworks
are quite specific and not easily met by existing tools. Ultimately,
this is what lead me into writing a reusable Python module providing
container plumbing and a relatively thin implementation of schroot using
namespaces on top of it.

If we can get the requested features from podman, choosing it is the
better choice to me for the maintainability reasons that you started
with. It is not clear though whether podman can be made to address our
requirements.

Thank you for having taken one step back and questioning my context
instead of going into my actual questions.

Helmut

Re: Reviving schroot as used by sbuild

Reply via email to