On Sun, 05 Feb 2023 at 21:23:18 +0100, Helmut Grohne wrote: > What is needed to make this work? mmdebstrap --mode=unshare requires the > following features: > * unprivileged unsharing of user namespaces > - This is prohibited on DSA machines via a sysctl > - It works on most other systems > - Test case: unshare -U true > * A subuid allocation in /etc/subuid > - Allocated by default during user creation > - Test case: grep -q ^$(id -un): /etc/subuid
To make this genuinely useful for typical container use-cases, you usually want a block of 65536 uids; otherwise you'll tend to get weird failures. User creation allocates a suitable block by default. > * subuid allocation must be mapped by container technology (if any) > - I suppose the unshare backend fails this. Likely also unprivileged > podmand. > * It must be possible to mount proc in the unshared user+mount+pid > namespace. > - This should always work but may be restricted by the container > technology for some reason. Container technologies that don't unshare the user namespace (uid 0 on the host = uid 0 in the container), notably Docker and older privileged setups for lxc, have no userns-based protection from misuse of things like /proc/sysrq-trigger; so they have to either prevent those attacks some other way, which tends to involve disallowing mounting /proc, or accept that root in the container gives you root in real life (leading to the mantra "containers don't contain"). > - Test case: unshare -U -m -p -f -r --mount-proc true > - Paul tried this in the operational lxc containers. Successfully. > - I tried this in a local autopkgtest-unstable lxc container. > Successfully (unprivileged). > - Johannes reported that this would be the step that fails. > * Maybe more, but I don't know what would be missing. Maybe Johannes > knows. A somewhat weaker form of user namespace support could be summarized as "enough to run bubblewrap", which requires unprivileged unsharing of user namespaces and the ability to mount a new instance of /proc, but does not require a subuid allocation or any setuid helpers like newuidmap/newgidmap. This would be enough for single-user single-app sandboxes like bubblewrap, Flatpak, epiphany-browser (GNOME Web), and probably Chromium's sandboxing (which doesn't use bubblewrap, but makes similar syscalls itself), but not enough for containers that want multiple uids, such as podman and mmdebstrap/unshare. smcv