Re: Reviving schroot as used by sbuild

Simon McVittie Wed, 26 Jun 2024 10:11:37 -0700

On Tue, 25 Jun 2024 at 18:55:45 +0200, Helmut Grohne wrote:
> At least for me, building container images locally is a requirement. I
> have no interest in using a container registry.


I expected you'd say that. podman --rootfs is one way to use it without
a registry; a trivially short Dockerfile like the one I mentioned,
to convert a tarball into a container image locally, is another.
(Debian's pseudo-official Docker images on Dockerhub use the latter.)

But I think it would be great if some part of Debian - perhaps the
cloud team? - could periodically publish genuinely official minbase
sysroot tarballs and/or OCI images from Debian infrastructure, like the
cloud team already does for VM images, which would avoid relying on a
third-party registry while also avoiding requiring every developer to
spend thought and CPU time on building their own before they can start
on their actual development.

> lxd/incus also was on my list, but my understanding is that they do not
> work without their system services at all and being able to operate
> containers (i.e. being incus-admin or the like) roughly becomes
> equivalent to being full root on the system defeating the purpose of the
> exercise.

Perhaps, I haven't looked into lxd/incus in detail (podman seems to have
the properties I wanted so I stopped there). I might have been misled by
the fact that lxd can run rootless containers - but maybe it can only
do that by making IPC requests to a privileged service, a bit like the
way snapd operates.

> I guess you understood my explanation differently than it was meant.
> While the container is persisted into the filesystem, this is being done
> for each package build individually. sbuild --chroot-mode=unshare and
> unschroot use a tarball as their source and opening the session amounts
> to extracting it. At the end of the session, the tree is disposed. The
> session concept of schroot is being reused in unschroot and it very much
> behaves like a type=file chroot except that you can begin a session,
> reboot and continue using it until you end it without requiring a system
> service to recover your sessions during boot.

OK, good: this is "the same shape" as schroot type=file, which is not
one of the modes that has the problems I described. If you're carrying
over the underlying on-disk directory across reboots, you'll have to
be a little careful about persisting state into that directory (only
things that will still be true after a reboot can safely be stored),
but I'm sure you're doing that.

> The main difference to how everyone else does this is that in a typical
> sbuild interaction it will create a new user namespace for every single
> command run as part of the session. sbuild issues tens of commands
> before launching dpkg-buildpackage and each of them creates new
> namespaces in the Linux kernel (all of them using the same uid mappings,
> performing the same bind mounts and so on). The most common way to think
> of containers is different: You create those namespaces once and reuse
> the same namespace kernel objects for multiple commands part of the same
> session (e.g. installation of build dependencies and dpkg-buildpackage).

Yes. My concern here is that there might be non-obvious reasons why
everyone else is doing this the other way, which could lead to behavioural
differences between unschroot and all the others that will come back to
bite us later.

> There two ways of
> interacting with containers that use one set of namespaces for their
> entire existence. One is setting up some IPC mechanism and receiving
> commands to be run inside (for instance spawning a shell and piping
> commands into it or driving the container via ssh) or an external
> process joins (setns) the existing container (namespaces) and injects
> code into it (docker exec). That latter approach has a history of
> vulnerabilities closely related to vulnerabilities in setuid binaries,
> because we are transitioning a process (and all of its context) from
> outside the container into it and thus expose all of its context (memory
> maps, open file descriptors and so on) to contained processes. As such,
> I think that an approach based on an IPC mechanism should be preferred.

An IPC-based approach is certainly going to provide better security
hardening (especially if setuid helpers are used), and potentially better
functionality as well.

In Flatpak (which uses namespaces too, but is not really the same sort
of container), the debugging command `flatpak enter` currently uses the
setns approach (which comes with various limitations), and one of the
items on my infinite to-do list is to make that be IPC-based instead,
possibly by reusing code written for steam-runtime-tools during $dayjob.

For whole-system containers running an OS image from init upwards,
or for virtual machines, using ssh as the IPC mechanism seems
pragmatic. Recent versions of systemd can even be given a ssh public
key via the systemd.system-credentials(7) mechanism (e.g. on the kernel
command line) to set it up to be accepted for root logins, which avoids
needing to do this setup in cloud-init, autopkgtest's setup-testbed,
or similar.

For "application" containers like the ones you would presumably want
to be using for sbuild, presumably something non-ssh is desirable.

> I am not sure whether podman exec operates in this way, but a quick
> codesearch did not exhibit obvious uses of setns inside the podman
> source code.  Would anyone be able to tell how podman exec is
> implemented here?

I don't know the answer to this.

> I think you really need one more non-trivial (but very commonly
> available) privilege. You need a cgroup manager (such as systemd) that
> allows creating and delegating a cgroup hierarchy to you.

Quite possibly, yes. I don't think I ever tried running
autopkgtest-virt-podman --init on a system that didn't have
systemd-as-pid-1 and a working `systemd --user`.

> Would someone be able to document (mail/wiki/blog/...) how to set up and
> use podman for running autopkgtests. Thus far, I failed to figure out
> how to plug a local Debian mirror (as opposed to a container registry)
> into autopkgtest-build-podman. It is quite difficult to locate podman
> documentation that is applicable under the assumption that you don't
> want to use any container registry.

If you build an image by importing a tarball that you have built in
whatever way you prefer, minimally something like this:

    $ cat > Dockerfile <<EOF
    FROM scratch
    ADD minbase.tar.gz /
    EOF
    $ podman build -f Dockerfile -t local-debian:sid .

then you should be able to use localhost/local-debian:sid
as a substitute for debian:sid in the examples given in
autopkgtest-virt-podman(1), either using it as-is for testing:

    $ autopkgtest -U hello*.dsc -- podman localhost/local-debian:sid

or making an image that has been pre-prepared with some essentials like
dpkg-source, and testing in that:

    $ autopkgtest-build-podman --image localhost/local-debian:sid
    ...
    Successfully tagged localhost/autopkgtest/localhost/local-debian:sid
    $ autopkgtest hello*.dsc -- podman autopkgtest/localhost/local-debian:sid
    (tests run)

Adding a mode for "start from this pre-prepared minbase tarball" to all
of the autopkgtest-build-* tools (so that they don't all need to know
how to run debootstrap/mmdebstrap from first principles, and then duplicate
the necessary options to make it do the right thing), has been on my
to-do list for literally years. Maybe one day I will get there.

We could certainly also benefit from some syntactic sugar to make the
automatic choice of an image name for localhost/* podman images nicer,
with fewer repetitions of localhost/.

By default (as per /etc/containers/registries.conf.d/shortnames.conf),
podman considers debian:sid to be short for docker.io/library/debian,
which is the closest thing we have to "official" Debian OCI images. If we
had our own self-hosted container registry with suitable scalability and
security, like Red Hat and SUSE do, that file could point there instead.
Salsa does in fact provide us with a self-hosted container registry,
but probably not one that is sufficiently scalable?

podman is unlikely to provide you with a way to generate a minbase
tarball without first creating or downloading some sort of container
image in which you can run debootstrap or mmdebstrap, because you have
to be able to start from somewhere. But you can run mmdebstrap unprivileged
in unshare mode, so that's enough to get you that starting point.

> We learned that sbuild --chroot-mode=unshare and unschroot spawn
> a new set of namespaces for every command. What you point out as a
> limitation also is a feature. Technically, it is a lie that the
> namespaces are always constructed in the same way. During installation
> of build depends the network namespace is not unshared while package
> builds commonly use an unshared network namespace with no interfaces but
> the loopback interface.

I don't think podman can do this within a single run. It might be feasible
to do the setup (installing build-dependencies) with networking enabled;
leave the root filesystem of that container intact; and reuse it as the
root filesystem of the container in which the actual build runs, this time
with --network=none?

Or the "install build-dependencies" step (and other setup) could perhaps
even be represented as a `podman build` (with a Dockerfile/Containerfile,
FROM the image you had as your starting point), outputting a temporary
container image, in which the actual dpkg-buildpackage step can be invoked
by `podman run --network=none --rmi`?

    smcv

Re: Reviving schroot as used by sbuild

Reply via email to