On Mon, Aug 19, 2019 at 02:09:11PM +0100, Stefan Hajnoczi wrote: > On Thu, Jun 06, 2019 at 12:09:12PM +0200, Stefano Garzarella wrote: > > > > Hi all, > > this is a v2 of a proposal addressing the comments made by Dexuan, Stefan, > > and Jorgen. > > > > v1: https://www.spinics.net/lists/netdev/msg570274.html > > > > > > > > We can define two types of transport that we have to handle at the same time > > (e.g. in a nested VM we would have both types of transport running > > together): > > > > - 'host->guest' transport, it runs in the host and it is used to communicate > > with the guests of a specific hypervisor (KVM, VMWare or Hyper-V). It also > > runs in the guest who has nested guests, to communicate with them. > > > > [Phase 2] > > We can support multiple 'host->guest' transport running at the same time, > > but on x86 only one hypervisor uses VMX at any given time. > > > > - 'guest->host' transport, it runs in the guest and it is used to > > communicate > > with the host. > > > > > > The main goal is to find a way to decide what transport use in these cases: > > 1. connect() / sendto() > > > > a. use the 'host->guest' transport, if the destination is the guest > > (dest_cid > VMADDR_CID_HOST). > > > > [Phase 2] > > In order to support multiple 'host->guest' transports running at the > > same > > time, we should assign CIDs uniquely across all transports. In this > > way, > > a packet generated by the host side will get directed to the > > appropriate > > transport based on the CID. > > > > b. use the 'guest->host' transport, if the destination is the host or the > > hypervisor. > > (dest_cid == VMADDR_CID_HOST || dest_cid == VMADDR_CID_HYPERVISOR) > > > > > > 2. listen() / recvfrom() > > > > a. use the 'host->guest' transport, if the socket is bound to > > VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no > > 'guest->host' transport. > > We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order to > > address this case. > > > > [Phase 2] > > We can support network namespaces to create independent AF_VSOCK > > addressing domains: > > - could be used to partition VMs between hypervisors or at a finer > > granularity; > > - could be used to isolate host applications from guest applications > > using the same ports with CID_ANY; > > > > b. use the 'guest->host' transport, if the socket is bound to local CID > > different from the VMADDR_CID_HOST (guest CID get with > > IOCTL_VM_SOCKETS_GET_LOCAL_CID), or it is bound to VMADDR_CID_ANY (to > > be > > backward compatible). > > Also in this case, we could define a new VMADDR_CID_LISTEN_FROM_HOST. > > > > c. shared port space between transports > > For incoming requests or packets, we should be able to choose which > > transport use, looking at the 'port' requested. > > > > - stream sockets already support shared port space between transports > > (one port can be assigned to only one transport) > > > > [Phase 2] > > - datagram sockets will support it, but for now VMCI transport is the > > default transport for any host side datagram socket (KVM and Hyper-V > > do not yet support datagrams sockets) > > > > We will make the loading of af_vsock.ko independent of the transports to > > allow to: > > - create a AF_VSOCK socket without any loaded transports; > > - listen on a socket (e.g. bound to VMADDR_CID_ANY) without any loaded > > transports; > > > > Hopefully, we could move MODULE_ALIAS_NETPROTO(PF_VSOCK) from the > > vmci_transport.ko to the af_vsock.ko. > > [Jorgen will check if this will impact the existing VMware products] > > > > Notes: > > - For Hyper-V sockets, the host can only be Windows. No changes should > > be required on the Windows host to support the changes on this > > proposal. > > > > - Communication between guests are not allowed on any transports, so we > > can > > drop packets sent from a guest to another guest (dest_cid > > > VMADDR_CID_HOST) if the 'host->guest' transport is not available. > > > > - [Phase 2] tag used to identify things that can be done at a later > > stage, > > but that should be taken into account during this design. > > > > - Namespace support will be developed in [Phase 2] or in a separate > > project. > > > > > > > > Comments and suggestions are welcome. > > I'll be on PTO for next two weeks, so sorry in advance if I'll answer later. > > > > If we agree on this proposal, when I get back, I'll start working on the > > code > > to get a first PATCH RFC. > > Stefano, > I've reviewed your proposal and it looks good for solving nested > virtualization.
Hi Stefan, Thank you very much for the review! > > The tricky implementation details will be supporting listen sockets, > especially with VMADDR_CID_ANY so they can be accessed from both > transports. Yes, it will be tricky because the current implementation has 1 to 1 mapping with the transport callbacks. Maybe I could move some logic in the core (e.g. for listening sockets) to have a single point of control. (e.g. using vsk->pending_links in all transports) I'll work on it in the next weeks, I'll keep you updated. Thanks, Stefano