On Tue, Mar 03, 2026 at 09:47:26PM +0100, Alexander Graf wrote: > > On 03.03.26 15:17, Bryan Tan wrote: > > On Tue, Mar 3, 2026 at 9:49 AM Stefano Garzarella <[email protected]> > > wrote: > > > On Mon, Mar 02, 2026 at 08:04:22PM +0100, Alexander Graf wrote: > > > > On 02.03.26 17:25, Stefano Garzarella wrote: > > > > > On Mon, Mar 02, 2026 at 04:48:33PM +0100, Alexander Graf wrote: > > > > > > On 02.03.26 13:06, Stefano Garzarella wrote: > > > > > > > CCing Bryan, Vishnu, and Broadcom list. > > > > > > > > > > > > > > On Mon, Mar 02, 2026 at 12:47:05PM +0100, Stefano Garzarella > > > > > > > wrote: > > > > > > > > Please target net-next tree for this new feature. > > > > > > > > > > > > > > > > On Mon, Mar 02, 2026 at 10:41:38AM +0000, Alexander Graf wrote: > > > > > > > > > Vsock maintains a single CID number space which can be used to > > > > > > > > > communicate to the host (G2H) or to a child-VM (H2G). The > > > > > > > > > current logic > > > > > > > > > trivially assumes that G2H is only relevant for CID <= 2 > > > > > > > > > because these > > > > > > > > > target the hypervisor. However, in environments like Nitro > > > > > > > > > Enclaves, an > > > > > > > > > instance that hosts vhost_vsock powered VMs may still want > > > > > > > > > to communicate > > > > > > > > > to Enclaves that are reachable at higher CIDs through > > > > > > > > > virtio-vsock-pci. > > > > > > > > > > > > > > > > > > That means that for CID > 2, we really want an overlay. By > > > > > > > > > default, all > > > > > > > > > CIDs are owned by the hypervisor. But if vhost registers a > > > > > > > > > CID, it takes > > > > > > > > > precedence. Implement that logic. Vhost already knows which > > > > > > > > > CIDs it > > > > > > > > > supports anyway. > > > > > > > > > > > > > > > > > > With this logic, I can run a Nitro Enclave as well as a > > > > > > > > > nested VM with > > > > > > > > > vhost-vsock support in parallel, with the parent instance > > > > > > > > > able to > > > > > > > > > communicate to both simultaneously. > > > > > > > > I honestly don't understand why VMADDR_FLAG_TO_HOST (added > > > > > > > > specifically for Nitro IIRC) isn't enough for this scenario > > > > > > > > and we have to add this change. Can you elaborate a bit more > > > > > > > > about the relationship between this change and > > > > > > > > VMADDR_FLAG_TO_HOST we added? > > > > > > > > > > > > The main problem I have with VMADDR_FLAG_TO_HOST for connect() is > > > > > > that it punts the complexity to the user. Instead of a single CID > > > > > > address space, you now effectively create 2 spaces: One for > > > > > > TO_HOST (needs a flag) and one for TO_GUEST (no flag). But every > > > > > > user space tool needs to learn about this flag. That may work for > > > > > > super special-case applications. But propagating that all the way > > > > > > into socat, iperf, etc etc? It's just creating friction. > > > > > Okay, I would like to have this (or part of it) in the commit > > > > > message to better explain why we want this change. > > > > > > > > > > > IMHO the most natural experience is to have a single CID space, > > > > > > potentially manually segmented by launching VMs of one kind within > > > > > > a certain range. > > > > > I see, but at this point, should the kernel set VMADDR_FLAG_TO_HOST > > > > > in the remote address if that path is taken "automagically" ? > > > > > > > > > > So in that way the user space can have a way to understand if it's > > > > > talking with a nested guest or a sibling guest. > > > > > > > > > > > > > > > That said, I'm concerned about the scenario where an application > > > > > does not even consider communicating with a sibling VM. > > > > > > > > If that's really a realistic concern, then we should add a > > > > VMADDR_FLAG_TO_GUEST that the application can set. Default behavior of > > > > an application that provides no flags is "route to whatever you can > > > > find": If vhost is loaded, it routes to vhost. If a vsock backend > > > mmm, we have always documented this simple behavior: > > > - CID = 2 talks to the host > > > - CID >= 3 talks to the guest > > > > > > Now we are changing this by adding fallback. I don't think we should > > > change the default behavior, but rather provide new ways to enable this > > > new behavior. > > > > > > I find it strange that an application running on Linux 7.0 has a default > > > behavior where using CID=42 always talks to a nested VM, but starting > > > with Linux 7.1, it also starts talking to a sibling VM. > > > > > > > driver is loaded, it routes there. But the application has no say in > > > > where it goes: It's purely a system configuration thing. > > > This is true for complex things like IP, but for VSOCK we have always > > > wanted to keep the default behavior very simple (as written above). > > > Everything else must be explicitly enabled IMHO. > > > > > > > > > > > > Until now, it knew that by not setting that flag, it could only talk > > > > > to nested VMs, so if there was no VM with that CID, the connection > > > > > simply failed. Whereas from this patch onwards, if the device in the > > > > > host supports sibling VMs and there is a VM with that CID, the > > > > > application finds itself talking to a sibling VM instead of a nested > > > > > one, without having any idea. > > > > > > > > I'd say an application that attempts to talk to a CID that it does now > > > > know whether it's vhost routed or not is running into "undefined" > > > > territory. If you rmmod the vhost driver, it would also talk to the > > > > hypervisor provided vsock. > > > Oh, I missed that. And I also fixed that behaviour with commit > > > 65b422d9b61b ("vsock: forward all packets to the host when no H2G is > > > registered") after I implemented the multi-transport support. > > > > > > mmm, this could change my position ;-) (although, to be honest, I don't > > > understand why it was like that in the first place, but that's how it is > > > now). > > > > > > Please document also this in the new commit message, is a good point. > > > Although when H2G is loaded, we behave differently. However, it is true > > > that sysctl helps us standardize this behavior. > > > > > > I don't know whether to see it as a regression or not. > > > > > > > > > > > > Should we make this feature opt-in in some way, such as sockopt or > > > > > sysctl? (I understand that there is the previous problem, but > > > > > honestly, it seems like a significant change to the behavior of > > > > > AF_VSOCK). > > > > > > > > We can create a sysctl to enable behavior with default=on. But I'm > > > > against making the cumbersome does-not-work-out-of-the-box experience > > > > the default. Will include it in v2. > > > The opposite point of view is that we would not want to have different > > > default behavior between 7.0 and 7.1 when H2G is loaded. > > From a VMCI perspective, we only allow communication from guest to > > host CIDs 0 and 2. With has_remote_cid implemented for VMCI, we end > > up attempting guest to guest communication. As mentioned this does > > already happen if there isn't an H2G transport registered, so we > > should be handling this anyways. But I'm not too fond of the change > > in behaviour for when H2G is present, so in the very least I'd > > prefer if has_remote_cid is not implemented for VMCI. Or perhaps > > if there was a way for G2H transport to explicitly note that it > > supports CIDs that are greater than 2? With this, it would be > > easier to see this patch as preserving the default behaviour for > > some transports and fixing a bug for others. > > > I understand what you want, but beware that it's actually a change in > behavior. Today, whether Linux will send vsock connects to VMCI depends on > whether the vhost kernel module is loaded: If it's loaded, you don't see the > connect attempt. If it's not loaded, the connect will come through to VMCI. > > I agree that it makes sense to limit VMCI to only ever see connects to <= 2 > consistently. But as I said above, it's actually a change in behavior. > > > Alex >
I think it was unintentional, but if you really think people want a special module that changes kernel's behaviour on load, we can certainly do that. But any hack like this will not be namespace safe. > > > Amazon Web Services Development Center Germany GmbH > Tamara-Danz-Str. 13 > 10243 Berlin > Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger > Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B > Sitz: Berlin > Ust-ID: DE 365 538 597

