Unpriveleged containers cannot run mknod, making it difficult to support devices which appear at runtime. Using devtmpfs is one possible solution, and it would have the added benefit of making container setup simpler. But simply letting containers mount devtmpfs isn't sufficient since the container may need to see a different, more limited set of devices, and because different environments making modifications to the filesystem could lead to conflicts.
This series solves these problems by assigning devices to user namespaces. Each device has an "owner" namespace which specifies which devtmpfs mount the device should appear in as well allowing priveleged operations on the device from that namespace. This defaults to init_user_ns. There's also an ns_global flag to indicate a device should appear in all devtmpfs mounts. devtmpfs is updated to present a different superblock to each user namespace. Each super block contains nodes for only global devices and the devices assigned to the associated namespace. The implementation isn't complete at this point - it's lacking proper cleanup when a namespace is no longer in use, and only a sampling of devices are updated to support use in namespaces. I'm sending the patches now for feedback on the overall approach and the implementation so far. I also have a couple of areas where I'd appreciate some suggestions: * If devices are owned by a namespace it might be useful to have this awareness for uevents and sysfs as well. Would it make sense to apply the ownership to kobjects rather than devices? * I'd like to be able to do clean up when a namespace is destroyed, e.g. with loop devices I'd probably free up any devices owned by the namespace. But that's impossible in the current implementation since the device has a reference to the namespace. Any suggestions to get around this? I haven't spent much time thinking about it yet, but my first thought was to add some kind of weak reference to user namespaces. Then when the main reference count hits zero the namespace isn't destroyed, but there would be a notification that drivers could use to perform cleanup. Once all weak references were released the memory would actually be freed. Thanks, Seth Seth Forshee (11): driver core: Assign owning user namespace to devices driver core: Add device_create_global() tmpfs: Add sub-filesystem data pointer to shmem_sb_info ramfs: Add sub-filesystem data pointer to ram_fs_info devtmpfs: Add support for mounting in user namespaces drivers/char/mem.c: Make null/zero/full/random/urandom available to user namespaces block: Make partitions inherit namespace from whole disk device block: Allow blkdev ioctls within user namespaces misc: Make loop-control available to all user namespaces loop: Assign devices to current_user_ns() loop: Allow priveleged operations for root in the namespace which owns a device block/compat_ioctl.c | 3 +- block/ioctl.c | 16 +- block/partition-generic.c | 2 + drivers/base/core.c | 54 ++++- drivers/base/devtmpfs.c | 509 ++++++++++++++++++++++++++++++++------------- drivers/block/loop.c | 22 +- drivers/char/mem.c | 28 ++- drivers/char/misc.c | 11 +- fs/ramfs/inode.c | 8 - include/linux/device.h | 18 ++ include/linux/miscdevice.h | 1 + include/linux/ramfs.h | 9 + include/linux/shmem_fs.h | 1 + 13 files changed, 499 insertions(+), 183 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/