On Wed, Apr 02, 2025 at 03:18:13PM -0700, Bobby Eshleman wrote: > On Wed, Apr 02, 2025 at 10:21:36AM +0100, Daniel P. Berrangé wrote: > > On Wed, Apr 02, 2025 at 10:13:43AM +0200, Stefano Garzarella wrote: > > > On Wed, 2 Apr 2025 at 02:21, Bobby Eshleman <bobbyeshle...@gmail.com> > > > wrote: > > > > > > > > I do like Stefano's suggestion to add a sysctl for a "strict" mode, > > > > Since it offers the best of both worlds, and still tends conservative in > > > > protecting existing applications... but I agree, the non-strict mode > > > > vsock would be unique WRT the usual concept of namespaces. > > > > > > Maybe we could do the opposite, enable strict mode by default (I think > > > it was similar to what I had tried to do with the kernel module in v1, I > > > was young I know xD) > > > And provide a way to disable it for those use cases where the user wants > > > backward compatibility, while paying the cost of less isolation. > > > > I think backwards compatible has to be the default behaviour, otherwise > > the change has too high risk of breaking existing deployments that are > > already using netns and relying on VSOCK being global. Breakage has to > > be opt in. > > > > > I was thinking two options (not sure if the second one can be done): > > > > > > 1. provide a global sysfs/sysctl that disables strict mode, but this > > > then applies to all namespaces > > > > > > 2. provide something that allows disabling strict mode by namespace. > > > Maybe when it is created there are options, or something that can be > > > set later. > > > > > > 2 would be ideal, but that might be too much, so 1 might be enough. In > > > any case, 2 could also be a next step. > > > > > > WDYT? > > > > It occured to me that the problem we face with the CID space usage is > > somewhat similar to the UID/GID space usage for user namespaces. > > > > In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to > > allow IDs in the namespace to be arbitrarily mapped onto IDs in the host. > > > > At the risk of being overkill, is it worth trying a similar kind of > > approach for the vsock CID space ? > > > > A simple variant would be a /proc/net/vsock_cid_outside specifying a set > > of CIDs which are exclusively referencing /dev/vhost-vsock associations > > created outside the namespace. Anything not listed would be exclusively > > referencing associations created inside the namespace. > > > > A more complex variant would be to allow a full remapping of CIDs as is > > done with userns, via a /proc/net/vsock_cid_map, which the same three > > parameters, so that CID=15 association outside the namespace could be > > remapped to CID=9015 inside the namespace, allow the inside namespace > > to define its out association for CID=15 without clashing. > > > > IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock > > associations created outside namespace, while unmapped CIDs would be > > exclusively referencing /dev/vhost-vsock associations inside the > > namespace. > > > > A likely benefit of relying on a kernel defined mapping/partition of > > the CID space is that apps like QEMU don't need changing, as there's > > no need to invent a new /dev/vhost-vsock-netns device node. > > > > Both approaches give the desirable security protection whereby the > > inside namespace can be prevented from accessing certain CIDs that > > were associated outside the namespace. > > > > Some rule would need to be defined for updating the /proc/net/vsock_cid_map > > file as it is the security control mechanism. If it is write-once then > > if the container mgmt app initializes it, nothing later could change > > it. > > > > A key question is do we need the "first come, first served" behaviour > > for CIDs where a CID can be arbitrarily used by outside or inside namespace > > according to whatever tries to associate a CID first ? > > I think with /proc/net/vsock_cid_outside, instead of disallowing the CID > from being used, this could be solved by disallowing remapping the CID > while in use? > > The thing I like about this is that users can check > /proc/net/vsock_cid_outside to figure out what might be going on, > instead of trying to check lsof or ps to figure out if the VMM processes > have used /dev/vhost-vsock vs /dev/vhost-vsock-netns. > > Just to check I am following... I suppose we would have a few typical > configurations for /proc/net/vsock_cid_outside. Following uid_map file > format of: > "<local cid start> <global cid start> <range > size>" > > 1. Identity mapping, current namespace CID is global CID (default > setting for new namespaces): > > # empty file > > OR > > 0 0 4294967295 > > 2. Complete isolation from global space (initialized, but no mappings): > > 0 0 0 > > 3. Mapping in ranges of global CIDs > > For example, global CID space starts at 7000, up to 32-bit max: > > 7000 0 4294960295 > > Or for multiple mappings (0-100 map to 7000-7100, 1000-1100 map to > 8000-8100) : > > 7000 0 100 > 8000 1000 100 > > > One thing I don't love is that option 3 seems to not be addressing a > known use case. It doesn't necessarily hurt to have, but it will add > complexity to CID handling that might never get used? > > Since options 1/2 could also be represented by a boolean (yes/no > "current ns shares CID with global"), I wonder if we could either A) > only support the first two options at first, or B) add just > /proc/net/vsock_ns_mode at first, which supports only "global" and > "local", and later add a "mapped" mode plus /proc/net/vsock_cid_outside > or the full mapping if the need arises? > > This could also be how we support Option 2 from Stefano's last email of > supporting per-namespace opt-in/opt-out. > > Any thoughts on this? >
Stefano, Would only supporting 1/2 still support the Kata use case? Thanks, Bobby