On Fri, Apr 04, 2025 at 02:05:32PM +0100, Daniel P. Berrangé wrote: > On Wed, Apr 02, 2025 at 03:18:13PM -0700, Bobby Eshleman wrote: > > On Wed, Apr 02, 2025 at 10:21:36AM +0100, Daniel P. Berrangé wrote: > > > It occured to me that the problem we face with the CID space usage is > > > somewhat similar to the UID/GID space usage for user namespaces. > > > > > > In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to > > > allow IDs in the namespace to be arbitrarily mapped onto IDs in the host. > > > > > > At the risk of being overkill, is it worth trying a similar kind of > > > approach for the vsock CID space ? > > > > > > A simple variant would be a /proc/net/vsock_cid_outside specifying a set > > > of CIDs which are exclusively referencing /dev/vhost-vsock associations > > > created outside the namespace. Anything not listed would be exclusively > > > referencing associations created inside the namespace. > > > > > > A more complex variant would be to allow a full remapping of CIDs as is > > > done with userns, via a /proc/net/vsock_cid_map, which the same three > > > parameters, so that CID=15 association outside the namespace could be > > > remapped to CID=9015 inside the namespace, allow the inside namespace > > > to define its out association for CID=15 without clashing. > > > > > > IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock > > > associations created outside namespace, while unmapped CIDs would be > > > exclusively referencing /dev/vhost-vsock associations inside the > > > namespace. > > > > > > A likely benefit of relying on a kernel defined mapping/partition of > > > the CID space is that apps like QEMU don't need changing, as there's > > > no need to invent a new /dev/vhost-vsock-netns device node. > > > > > > Both approaches give the desirable security protection whereby the > > > inside namespace can be prevented from accessing certain CIDs that > > > were associated outside the namespace. > > > > > > Some rule would need to be defined for updating the > > > /proc/net/vsock_cid_map > > > file as it is the security control mechanism. If it is write-once then > > > if the container mgmt app initializes it, nothing later could change > > > it. > > > > > > A key question is do we need the "first come, first served" behaviour > > > for CIDs where a CID can be arbitrarily used by outside or inside > > > namespace > > > according to whatever tries to associate a CID first ? > > > > I think with /proc/net/vsock_cid_outside, instead of disallowing the CID > > from being used, this could be solved by disallowing remapping the CID > > while in use? > > > > The thing I like about this is that users can check > > /proc/net/vsock_cid_outside to figure out what might be going on, > > instead of trying to check lsof or ps to figure out if the VMM processes > > have used /dev/vhost-vsock vs /dev/vhost-vsock-netns. > > > > Just to check I am following... I suppose we would have a few typical > > configurations for /proc/net/vsock_cid_outside. Following uid_map file > > format of: > > "<local cid start> <global cid start> <range > > size>" > > > > 1. Identity mapping, current namespace CID is global CID (default > > setting for new namespaces): > > > > # empty file > > > > OR > > > > 0 0 4294967295 > > > > 2. Complete isolation from global space (initialized, but no mappings): > > > > 0 0 0 > > > > 3. Mapping in ranges of global CIDs > > > > For example, global CID space starts at 7000, up to 32-bit max: > > > > 7000 0 4294960295 > > > > Or for multiple mappings (0-100 map to 7000-7100, 1000-1100 map to > > 8000-8100) : > > > > 7000 0 100 > > 8000 1000 100 > > > > > > One thing I don't love is that option 3 seems to not be addressing a > > known use case. It doesn't necessarily hurt to have, but it will add > > complexity to CID handling that might never get used? > > Yeah, I have the same feeling that full remapping of CIDs is probably > adding complexity without clear benefit, unless it somehow helps us > with the nested-virt scenario to disambiguate L0/L1/L2 CID ranges ? > I've not thought the latter through to any great level of detail > though > > > Since options 1/2 could also be represented by a boolean (yes/no > > "current ns shares CID with global"), I wonder if we could either A) > > only support the first two options at first, or B) add just > > /proc/net/vsock_ns_mode at first, which supports only "global" and > > "local", and later add a "mapped" mode plus /proc/net/vsock_cid_outside > > or the full mapping if the need arises? > > Two options is sufficient if you want to control AF_VSOCK usage > and /dev/vhost-vsock usage as a pair. If you want to separately > control them though, it would push for three options - global, > local, and mixed. By mixed I mean AF_VSOCK in the NS can access > the global CID from the NS, but the NS can't associate the global > CID with a guest. > > IOW, this breaks down like: > > * CID=N local - aka fully private > > Outside NS: Can associate outside CID=N with a guest. > AF_VSOCK permitted to access outside CID=N > > Inside NS: Can NOT associate outside CID=N with a guest > Can associate inside CID=N with a guest > AF_VSOCK forbidden to access outside CID=N > AF_VSOCK permitted to access inside CID=N > > > * CID=N mixed - aka partially shared > > Outside NS: Can associate outside CID=N with a guest. > AF_VSOCK permitted to access outside CID=N > > Inside NS: Can NOT associate outside CID=N with a guest > AF_VSOCK permitted to access outside CID=N > No inside CID=N concept > > > * CID=N global - aka current historic behaviour > > Outside NS: Can associate outside CID=N with a guest. > AF_VSOCK permitted to access outside CID=N > > Inside NS: Can associate outside CID=N with a guest > AF_VSOCK permitted to access outside CID=N > No inside CID=N concept > > > I was thinking the 'mixed' mode might be useful if the outside NS wants > to retain control over setting up the association, but delegate to > processes in the inside NS for providing individual services to that > guest. This means if the outside NS needs to restart the VM, there is > no race window in which the inside NS can grab the assocaition with the > CID > > As for whether we need to control this per-CID, or a single setting > applying to all CID. > > Consider that the host OS can be running one or more "service VMs" on > well known CIDs that can be leveraged from other NS, while those other > NS also run some "end user VMs" that should be private to the NS. > > IOW, the CIDs for the service VMs would need to be using "mixed" > policy, while the CIDs for the end user VMs would be "local". >
I think this sounds pretty flexible, and IMO adding the third mode doesn't add much more additional complexity. Going this route, we have: - three modes: local, global, mixed - at first, no vsock_cid_map (local has no outside CIDs, global and mixed have no inside CIDs, so no cross-mapping needed) - only later add a full mapped mode and vsock_cid_map if necessary. Stefano, any preferences on this vs starting with the restricted vsock_cid_map (only supporting "0 0 0" and "0 0 <size>")? I'm leaning towards the modes because it covers more use cases and seems like a clearer user interface? To clarify another aspect... child namespaces must inherit the parent's local. So if namespace P sets the mode to local, and then creates a child process that then creates namespace C... then C's global and mixed modes are implicitly restricted to P's local space? Thanks, Bobby