Chris Webb <ch...@arachsys.com> writes: > Prompted by the new userns support merged in the 3.8/3.9 kernels, I've been > playing with namespaces and trying to understand how I could use them to > build containers to replace some of my uses of qemu-kvm virtual machines. > > I've successfully created a fakeroot-type container running as an > unprivileged user by unsharing everything including CLONE_NEWUSER, and can > map a block of host UIDs for that environment by writing to > /proc/PID/[ug]id_map from a helper process running as root. > > However, what I'm hoping for in practice is to be able to create containers > whose access to its filesystem subtree is untranslated, i.e. uid/gid N in > the container maps to uid/gid N in a subdirectory of the filesystem, but > which is still isolated from the rest of the host filesystem and can't do > externally privileged things. This is pretty much what a BSD jail provides, > for example. > > Is this possible to achieve securely using the mechanisms now available? > (I'm assuming that parent directory permissions prevent unprivileged host > users from getting at these container filesystems, exactly as is necessary > to make BSD jails safe.) > > > As a first step, I naively tried running as root and unsharing everything > with > > unshare(CLONE_NEWIPC | CLONE_NEWNS | CLONE_NEWNET | CLONE_NEWPID > | CLONE_NEWUTS | CLONE_NEWUSER); > > before execing a shell[1]. From another root process in the host namespace, > I then wrote a pass-through mapping 0 0 4294967295 to /proc/PID/[ug]id_map.
That will work, but you really don't want to run with uid == 0 mapped to uid == 0. There are too many things in /proc and /sys and similar that grant access to uid == 0. > The result initially looks plausible, with the PID namespace preventing > signals being sent from one container to another, despite those processes > sharing the same user ID in the top-level user namespace. > > However, unfortunately I still have too many privileges with respect to the > host. Whilst (for example) I can't mknod, I can mount a sysfs or procfs and > apparently write to them with host root privileges to reconfigure the host > kernel. I suspect there will be other things I haven't secured by this > recipe too. Yes. I recommend having a dedicated range of uids for your container to prevent this kind of silliness. Or at the very least a separate mapping of uid == 0. > I also tried tightening things up by dropping capabilities from my root user > and preventing capability grant on exec by setting and locking SECBIT_NOROOT > on before starting the container. However, I'm not sure this really makes > any difference---does CLONE_NEWUSER drop all capabilities with respect to > the parent namespace? Yes. CLONE_NEWUSER drops all capabilities with respect to the parent namespace. > [1] In this description, I'm ignoring the part where I lock into a new root > filesystem, but presumably the way to do this is by pivot_root into a bind > mount? Yes pivot_root and bind mount work. ERic -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/