On Fri, 13 Sep 2024 at 11:15:55 +0200, Helmut Grohne wrote: > My initial experiments indicate that we're in > for a factor two [slowdown] whereas we could get this down significantly > by using an overlayfs approach that we cannot shoehorn into podman.
Er, podman does use overlayfs, in at least some circumstances? $ podman run --rm -it debian:sid-slim grep ' / / ' /proc/self/mountinfo 464 131 0:101 / / rw,relatime - overlay overlay rw,lowerdir=/home/smcv/.local/share/containers/storage/overlay/l/[…],upperdir=/home/smcv/.local/share/containers/storage/overlay/[…]/diff,workdir=/home/smcv/.local/share/containers/storage/overlay/[…]/work,redirect_dir=nofollow,uuid=on,volatile,userxattr In unstable (and I think also bookworm but I haven't checked recently), /usr/share/containers/storage.conf defaults to the "overlay" driver - but the real default is whatever already exists in ~/.local/share/containers/storage, with the configured driver only used for new setups, unless forced. I think the performance characteristics you describe probably mean that you have container storage that is already using the "vfs" driver, which is indeed based on quite a lot of copying. > podman > upstream insists on CAP_SYS_ADMIN being a no go while systemd upstream > insists on CAP_SYS_ADMIN being a requirement Sorry, this is just not true, in either direction. podman can be configured to allow CAP_SYS_ADMIN inside the container (podman run --cap-add=CAP_SYS_ADMIN), but it isn't the default, because it likely[1] means that "containers don't contain" (no effective security boundary between root in the container, and the user whose uid was mapped to the container's uid 0). I suspect the same is going to be equally true for anything that retains CAP_SYS_ADMIN and maps your real uid to a container uid, but having a uid in common is usually desirable if you want to be able to provide files to the container, or provide a place where the container can write files back out. systemd doesn't "insist on" CAP_SYS_ADMIN either - it specifically doesn't require it! - but some individual systemd features do require it. At the moment, it will fail closed (services like polkitd whose security-hardening settings need CAP_SYS_ADMIN fail to start), which surprised me, because other systemd security-hardening settings tend to fail open (if systemd doesn't have all of the necessary capabilities or kernel features then the service still starts, but the rest of the containerized system is less protected from the service than it could have been). [1] I asked podman upstream and the answer can be summarized as "it's complicated, but probably" > I have reached the > conclusion that doing a persistent namespace requires a background > process and an IPC mechanism. (This requirement rules out > podman/docker/crun/runc.) podman/docker can certainly run a background process that accepts commands via IPC. They don't do this by default, sure, but if you make the container payload include a process that accepts commands - perhaps on an AF_UNIX or TCP socket, or through pipes - then they won't stand in the way of doing that. (Proof of concept 1: a podman container with an init system and sshd. Proof of concept 2: the persistent process is a shell inside the container, and the IPC mechanism is a pipe on stdin and another pipe on stdout. Obviously an interactive shell makes a really bad IPC protocol, as we already knew from autopkgtest-virt-qemu and LAVA, and for production use it would be better to use a more structured protocol with proper framing and error handling, like the D-Bus interface that systemd-run uses - but that's an implementation detail.) smcv