In this patch, we update the design document to reflect the netlink based kernel-userspace interface implementation and a few other changes. I have covered at a high level.
Please feel free to extend the document with more details that you think got missed out. Signed-off-by: Nithin Raju <nit...@vmware.com> --- datapath-windows/DESIGN | 260 +++++++++++++++++++++++++++++----------------- 1 files changed, 164 insertions(+), 96 deletions(-) diff --git a/datapath-windows/DESIGN b/datapath-windows/DESIGN index b438c44..638990d 100644 --- a/datapath-windows/DESIGN +++ b/datapath-windows/DESIGN @@ -1,20 +1,13 @@ OVS-on-Hyper-V Design Document ============================== -There has been an effort in the recent past to develop the Open vSwitch (OVS) -solution onto multiple hypervisor platforms such as FreeBSD and Microsoft -Hyper-V. VMware has been working on a OVS solution for Microsoft Hyper-V for -the past few months and has successfully completed the implementation. - -This document provides details of the development effort. We believe this -document should give enough information to members of the community who are -curious about the developments of OVS on Hyper-V. The community should also be -able to get enough information to make plans to leverage the deliverables of -this effort. - -The userspace portion of the OVS has already been ported to Hyper-V and -committed to the openvswitch repo. So, this document will mostly emphasize on -the kernel driver, though we touch upon some of the aspects of userspace as -well. +There has been a community effort to develop Open vSwitch on Microsoft Hyper-V. +In this document, we provide details of the development effort. We believe this +document should give enough information to understand the overall design. + +The userspace portion of the OVS has been ported to Hyper-V in a separate +effort, and committed to the openvswitch repo. So, this document will mostly +emphasize on the kernel driver, though we touch upon some of the aspects of +userspace as well. We cover the following topics: 1. Background into relevant Hyper-V architecture @@ -48,13 +41,13 @@ In Hyper-V, the virtual machine is called the Child Partition. Each VIF or physical NIC on the Hyper-V extensible switch is attached via a port. Each port is both on the ingress path or the egress path of the switch. The ingress path is used for packets being sent out of a port, and egress is used for packet -being received on a port. By design, NDIS provides a layered interface, where -in the ingress path, higher level layers call into lower level layers, and on -the egress path, it is the other way round. In addition, there is a object -identifier (OID) interface for control operations Eg. addition of a port. The -workflow for the calls is similar in nature to the packets, where higher level -layers call into the lower level layers. A good representational diagram of -this architecture is in [4]. +being received on a port. By design, NDIS provides a layered interface. In this +layered interface, higher level layers call into lower level layers, in the +ingress path. In the egress path, it is the other way round. In addition, there +is a object identifier (OID) interface for control operations Eg. addition of +a port. The workflow for the calls is similar in nature to the packets, where +higher level layers call into the lower level layers. A good representational +diagram of this architecture is in [4]. Windows Filtering Platform (WFP)[5] is a platform implemented on Hyper-V that provides APIs and services for filtering packets. WFP has been utilized to @@ -75,22 +68,23 @@ has been used to retrieve some of the configuration information that OVS needs. | | +------+ +--------------+ | +-----------+ +------------+ | | | | | | | | | | | - | OVS- | | OVS | | | Virtual | | Virtual | | - | wind | | USERSPACE | | | Machine #1| | Machine #2 | | - | | | DAEMON/CTL | | | | | | | + | ovs- | | OVS- | | | Virtual | | Virtual | | + | *ctl | | USERSPACE | | | Machine #1| | Machine #2 | | + | | | DAEMON | | | | | | | +------+-++---+---------+ | +--+------+-+ +----+------++ | +--------+ - | DPIF- | | netdev- | | |VIF #1| |VIF #2| | |Physical| - | Windows |<=>| Windows | | +------+ +------+ | | NIC | + | dpif- | | netdev- | | |VIF #1| |VIF #2| | |Physical| + | netlink | | windows | | +------+ +------+ | | NIC | +---------+ +---------+ | || /\ | +--------+ -User /\ | || *#1* *#4* || | /\ -=========||=======================+------||-------------------||--+ || -Kernel || \/ || ||=====/ - \/ +-----+ +-----+ *#5* +User /\ /\ | || *#1* *#4* || | /\ +=========||=========||============+------||-------------------||--+ || +Kernel || || \/ || ||=====/ + \/ \/ +-----+ +-----+ *#5* +-------------------------------+ | | | | | +----------------------+ | | | | | | | OVS Pseudo Device | | | | | | - | +----------------+-----+ | | | | | - | | | I | | | + | +----------------------+ | | | | | + | | Netlink Impl. | | | | | | + | ----------------- | | I | | | | +------------+ | | N | | E | | | Flowtable | +------------+ | | G | | G | | +------------+ | Packet | |*#2*| R | | R | @@ -110,9 +104,8 @@ Kernel || \/ || ||=====/ Figure 2 shows the various blocks involved in the OVS Windows implementation, along with some of the components available in the NDIS stack, and also the virtual machines. The workflow of a packet being transmitted from a VIF out and -into another VIF and to a physical NIC is also shown. New userspace components -being added as also shown. Later on in this section, we’ll discuss the flow of -a packet at a high level. +into another VIF and to a physical NIC is also shown. Later on in this section, +we will discuss the flow of a packet at a high level. The figure gives a general idea of where the OVS userspace and the kernel components fit in, and how they interface with each other. @@ -122,9 +115,11 @@ a forwarding extension roughly implementing the following sub-modules/functionality. Details of each of these sub-components in the kernel are contained in later sections: * Interfacing with the NDIS stack + * Netlink message parser + * Netlink sockets * Switch/Datapath management * Interfacing with userspace portion of the OVS solution to implement the - necessary ioctls that userspace needs + necessary functionality that userspace needs * Port management * Flowtable/Actions/packet forwarding * Tunneling @@ -140,32 +135,36 @@ are: * Interface between the userspace and the kernel module. * Event notifications are significantly different. * The communication interface between DPIF and the kernel module need not be - implemented in the way OVS on Linux does. + implemented in the way OVS on Linux does. That said, it would be + advantageous to have a similar interface to the kernel module for reasons of + readability and maintainability. * Any licensing issues of using Linux kernel code directly. Due to these differences, it was a straightforward decision to develop the datapath for OVS on Hyper-V from scratch rather than porting the one on Linux. -A re-development focussed on the following goals: +A re-development focused on the following goals: * Adhere to the existing requirements of userspace portion of OVS (such as - ovs- vswitchd), to minimize changes in the userspace workflow. + ovs-vswitchd), to minimize changes in the userspace workflow. * Fit well into the typical workflow of a Hyper-V extensible switch forwarding extension. The userspace portion of the OVS solution is mostly POSIX code, and not very -Linux specific. Majority of the code has already been ported and committed to -the openvswitch repo. Most of the daemons such as ovs-vswitchd or ovsdb-server -can run on Windows now. One additional daemon that has been implemented is -called ovs-wind. At a high level ovs-wind manages keeps the ovsdb used by -userspace in sync with the kernel state. More details in the userspace section. +Linux specific. Majority of the userspace code does not interface directly with +the kernel datapath and was ported independently of the kernel datapath +effort. As explained in the OVS porting design document [7], DPIF is the portion of -userspace that interfaces with the kernel portion of the OVS. Each platform can -have its own implementation of the DPIF provider whose interface is defined in -dpif-provider.h [3]. For OVS on Hyper-V, we have an implementation of DPIF -provider for Hyper-V. The communication interface between userspace and the -kernel is a pseudo device and is different from that of the Linux’s DPIF -provider which uses netlink. But, as long as the DPIF provider interface is the -same, the callers should be agnostic of the underlying communication interface. +userspace that interfaces with the kernel portion of the OVS. The interface +that each DPIF provider has to implement is defined in dpif-provider.h [3]. +Though each platform is allowed to have its own implementation of the DPIF +provider, it was found, via community feedback, than it is desired to +share code whenever possible. Thus, the DPIF provider for OVS on Hyper-V shares +code with the DPIF provider on Linux. This interface is implemented in +dpif-netlink.c, formerly dpif-linux.c. + +We'll elaborate more on kernel-userspace interface in a dedicated section +below. Here it suffices to say that the DPIF provider implementation for +Windows is netlink-based and shares code with the Linux one. 2.a) Kernel module (datapath) ----------------------------- @@ -178,8 +177,8 @@ This is consistent with using a single datapath in the kernel on Linux. All the physical adapters are connected as external adapters to the extensible switch. When the OVS switch extension registers itself as a filter driver, it also -registers callbacks for the switch management and datapath functions. In other -words, when a switch is created on the Hyper-V root partition (host), the +registers callbacks for the switch/port management and datapath functions. In +other words, when a switch is created on the Hyper-V root partition (host), the extension gets an activate callback upon which it can initialize the data structures necessary for OVS to function. Similarly, there are callbacks for when a port gets added to the Hyper-V switch, and an External Network adapter @@ -190,7 +189,7 @@ packet is received on an external NIC. As shown in the figures, an extensible switch extension gets to see a packet sent by the VM (VIF) twice - once on the ingress path and once on the egress path. Forwarding decisions are to be made on the ingress path. Correspondingly, -we’ll be hooking onto the following interfaces: +we will be hooking onto the following interfaces: * Ingress send indication: intercept packets for performing flow based forwarding.This includes straight forwarding to output ports. Any packet modifications needed to be performed are done here either inline or by @@ -203,11 +202,41 @@ we’ll be hooking onto the following interfaces: Interfacing with OVS userspace ------------------------------ -We’ve implemented a pseudo device interface for letting OVS userspace talk to +We have implemented a pseudo device interface for letting OVS userspace talk to the OVS kernel module. This is equivalent to the typical character device -interface on POSIX platforms. The pseudo device supports a whole bunch of +interface on POSIX platforms where we can register custom functions for read, +write and ioctl functionality. The pseudo device supports a whole bunch of ioctls that netdev and DPIF on OVS userspace make use of. +Netlink message parser +---------------------- +The communication between OVS userspace and OVS kernel datapath is in the form +of Netlink messages [1]. More details about this are provided in #2.c section, +kernel-userspace interface. In the kernel, a full fledged netlink message +parser has been implemented along the lines of the netlink message parser in +OVS userspace. In fact, a lot of the code is ported code. + +On the lines of 'struct ofpbuf' in OVS userspace, a managed buffer has been +implemented in the kernel datapath to make it easier to parse and construct +netlink messages. + +Netlink sockets +--------------- +On Linux, OVS userspace utilizes netlink sockets to pass back and forth netlink +messages. Since much of userspace code including DPIF provider in +dpif-netlink.c (formerly dpif-linux.c) has been reused, pseudo-netlink sockets +have been implemented in OVS userspace. As it is known, Windows lacks native +netlink socket support, and also the socket family is not extensible either. +Hence it is not possible to provide a native implementation of netlink socket. +We emulate netlink sockets in lib/netlink-socket.c and support all of the nl_* +APIs to higher levels. The implementation opens a handle to the pseudo device +for each netlink socket. Some more details on this topic are provided in the +userspace section on netlink sockets. + +Typical netlink semantics of read message, write message, dump, and transaction +have been implemented so that higher level layers are not affected by the +netlink implementation not being native. + Switch/Datapath management -------------------------- As explained above, we hook onto the management callback functions in the NDIS @@ -267,48 +296,83 @@ used. 2.b) Userspace components ------------------------- -A new daemon has been added to userspace to manage the entities in OVSDB, and -also to keep it in sync with the kernel state, and this include bridges, -physical NICs, VIFs etc. For example, upon bootup, ovs-wind does a get on the -kernel to get a list of the bridges, and the corresponding ports and populates -OVSDB. If a new VIF gets added to the kernel switch because a user powered on a -Virtual Machine, ovs-wind detects it, and adds a corresponding entry in the -ovsdb. This implies that ovs-wind has a synchronous as well as an asynchronous -interface to the OVS kernel driver. - +The userspace portion of the OVS solution is mostly POSIX code, and not very +Linux specific. Majority of the userspace code does not interface directly with +the kernel datapath and was ported independently of the kernel datapath +effort. + +In this section, we cover the userspace components that interface with the +kernel datapath. + +As explained earlier, OVS on Hyper-V shares the DPIF provider implementation +with Linux. The DPIF provider on Linux uses netlink sockets and netlink +messages. Netlink sockets and messages are extensively used on Linux to +exchange information between userspace and kernel. In order to satisfy these +dependencies, netlink socket (pseudo and non-native) and netlink messages +are implemented on Hyper-V. + +The following are the major advantages of sharing DPIF provider code: +1. Maintenance is simpler: + Any change made to the interface defined in dpif-provider.h need not be + propagated to multiple implementations. Also, developers familiar with the + Linux implementation of the DPIF provider can easily ramp on the Hyper-V + implementation as well. +2. Netlink messages provides inherent advantages: + Netlink messages are known for their extensibility. Each message is + versioned, so the provided data structures offer a mechanism to perform + version checking and forward/backward compatibility with the kernel + module. + +Netlink sockets +--------------- +As explained in other sections, an emulation of netlink sockets has been +implemented in lib/netlink-socket.c for Windows. The implementation creates a +handle to the OVS pseudo device, and emulates netlink socket semantics of +receive message, send message, dump, and transact. Most of the nl_* functions +are supported. + +The fact that the implementation is non-native manifests in various ways. +One example is that PID for the netlink socket is not automatically assigned in +userspace when a handle is created to the OVS pseudo device. There's an extra +command (defined in OvsDpInterfaceExt.h) that is used to grab the PID generated +in the kernel. + +DPIF provider +-------------- +As has been mentioned in earlier sections, the netlink socket and netlink +message based DPIF provider on Linux has been ported to Windows. +Correspondingly, the file is called lib/dpif-netlink.c now from its former +name of lib/dpif-linux.c. -2.c) Kernel-Userspace interface -------------------------------- -DPIF-Windows ------------- -DPIF-Windows is the Windows implementation of the interface defined in dpif- -provider.h, and provides an interface into the OVS kernel driver. We implement -most of the callbacks required by the DPIF provider. A quick summary of the -functionality implemented is as follows: - * dp_dump, dp_get: dump all datapath information or get information for a - particular datapath. Currently we only support one datapath. - * flow_dump, flow_put, flow_get, flow_flush: These functions retrieve all - flows in the kernel, add a flow to the kernel, get a specific flow and - delete all the flows in the kernel. - * recv_set, recv, recv_wait, recv_purge: these poll packets for upcalls. - * execute: This is used to send packets from userspace to the kernel. The - packets could be either flow miss packet punted from kernel earlier or - userspace generated packets. - * vport_dump, vport_get, ext_info: These functions dump all ports in the - kernel, get a specific port in the kernel, or get extended information - about a port. - * event_subscribe, wait, poll: These functions subscribe, wait and poll the - events that kernel posts. A typical example is kernel notices a port has - gone up/down, and would like to notify the userspace. +Most of the code is common. Some divergence is in the code to receive +packets. The Linux implementation uses epoll() which is not natively supported +on Windows. Netdev-Windows -------------- -We have a Windows implementation of the the interface defined in lib/netdev- -provider.h. The implementation provided functionality to get extended -information about an interface. It is limited in functionality compared to the -Linux implementation of the netdev provider and cannot be used to add any -interfaces in the kernel such as a tap interface. +We have a Windows implementation of the interface defined in +lib/netdev-provider.h. The implementation provides functionality to get +extended information about an interface. It is limited in functionality +compared to the Linux implementation of the netdev provider and cannot be used +to add any interfaces in the kernel such as a tap interface or to send/receive +packets. The netdev-windows implementation uses the datapath interface +extensions defined in: +datapath-windows/include/OvsDpInterfaceExt.h +2.c) Kernel-Userspace interface +------------------------------- +openvswitch.h and OvsDpInterfaceExt.h +------------------------------------- +Since the DPIF provider is shared with Linux, the kernel datapath provides the +same interface as the Linux datapath. The interface is defined in +datapath/linux/compat/include/linux/openvswitch.h. Derivatives of this +interface file are created during OVS userspace compilation. The derivative for +the kernel datapath on Hyper-V is provided in the following location: +datapath-windows/include/OvsDpInterface.h + +That said, there are Windows specific extensions that are defined in the +interface file: +datapath-windows/include/OvsDpInterfaceExt.h 2.d) Flow of a packet --------------------- @@ -354,9 +418,9 @@ driver. Reference list: =============== -1: Hyper-V Extensible Switch +1. Hyper-V Extensible Switch http://msdn.microsoft.com/en-us/library/windows/hardware/hh598161(v=vs.85).aspx -2: Hyper-V Extensible Switch Extensions +2. Hyper-V Extensible Switch Extensions http://msdn.microsoft.com/en-us/library/windows/hardware/hh598169(v=vs.85).aspx 3. DPIF Provider http://openvswitch.sourcearchive.com/documentation/1.1.0-1/dpif- @@ -369,3 +433,7 @@ http://msdn.microsoft.com/en-us/library/windows/desktop/aa366510(v=vs.85).aspx http://msdn.microsoft.com/en-us/library/windows/hardware/ff557015(v=vs.85).aspx 7. How to Port Open vSwitch to New Software or Hardware http://git.openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=PORTING +8. Netlink +http://en.wikipedia.org/wiki/Netlink +9. epoll +http://en.wikipedia.org/wiki/Epoll -- 1.7.4.1 _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev