Open Virtual Network (OVN) Proposed Architecture ================================================
The Open vSwitch team is pleased to announce OVN, a new subproject in development within the Open vSwitch. The full project announcement is at Network Heresy and reproduced at: http://openvswitch.org/pipermail/dev/2015-January/050379.html OVN complements the existing capabilities of OVS to add native support for virtual network abstractions, such as virtual L2 and L3 overlays and security groups. Just like OVS, our design goal is to have a production-quality implementation that can operate at significant scale. This post outlines the proposed high level architecture for OVN. This document mainly discusses the design of OVN on hypervisors (including container systems). The complete OVN system will also include support for software and hardware gateways for logical-physical integration, and perhaps also for service nodes to offload multicast replication from hypervisors. Each of these classes of devices has much in common, so our discussion here refers to them collectively as "chassis" or "transport nodes". Layering ======== From lowest to highest level, OVN comprises the following layers. Open vSwitch ------------ The lowest layer is Open vSwitch, that is, ovs-vswitchd and ovsdb-server. OVN will use standard Open vSwitch, not some kind of specially patched or modified version. OVN will use some of the Open vSwitch extensions to OpenFlow, since many of those extensions were introduced to solve problems in network virtualization. For that reason, OVN will probably not work with OpenFlow implementations other than Open vSwitch. On hypervisors, we expect OVN to use the hypervisor integration features described in IntegrationGuide.md in the OVS repository. These features allow controllers, like ovn-controller, to associate vifs instantiated on a given hypervisor with configured VMs and their virtual interfaces. The same interfaces allow for container integration. ovn-controller -------------- The layer just above Open vSwitch proper consists of ovn-controller, an additional daemon that runs on every chassis (hardware VTEPs are a special case; they might use something different). Southbound, it talks to ovs-vswitchd over the OpenFlow protocol (with extensions) and to ovsdb-server over the OVSDB protocol. OVN Database | | (OVSDB Protocol) | +-------------------------------------------------------------------+ | | | | | | | ovn-controller | | | | | | | | | | +--------------+ +--------------+ | | | | | | | | | | (OVSDB Protocol) (OpenFlow) | | | | | | | | | | ovsdb-server ovs-vswitchd | | | +---------------------------- Hypervisor ---------------------------+ ovn-controller does not interact directly with the Open vSwitch kernel module (or DPDK or any other datapath). Instead, it uses the same public OpenFlow and OVSDB interfaces used by any other controller. This avoids entangling OVN and OVS. Each version of ovn-controller will require some minimum version of Open vSwitch. It may be necessary to pair matching versions of ovn-controller and OVS (which is likely feasible, since they run on the same physical machine), but it is probably possible, and better, to tolerate some version skew. Northbound, ovn-controller talks to the OVN database (described in the following section) using a database protocol. ovn-controller has the following tasks: * Translate the configuration and state obtained from the OVN database into OpenFlow flows and other state, and push that state down into Open vSwitch over the OpenFlow and OVSDB protocols. This occurs in response to network configuration updates, not in reaction to data packets arriving from virtual or physical interfaces. Examples include translation of logical datapath flows into OVS flows (see Logical Datapath Flows below) via OpenFlow to ovs-vswitchd, and instantiation of tunnels via OVSDB to the chassis's ovsdb-server. * Populate the bindings component of the OVN database (described later) with chassis state relevant to OVN. On a hypervisor, this includes the vifs instantiated on the hypervisor at any given time, to allow other chassis to correctly forward packets destined to VMs on this hypervisor. On a gateway (software or hardware), this includes MAC learning state from physical ports. * In corner cases, respond to packets arriving from virtual interfaces (via OpenFlow). For example, ARP suppression may require observing packets from VMs through OpenFlow "packet-in" messages. ovn-controller is not a centralized controller but what we refer to as a "local controller", since an independent instance runs locally on every hypervisor. ovn-controller is not a general purpose "SDN controller"--it performs the specific tasks outlined above in support of the virtual networking functionality of OVN. OVN database ------------ The OVN database contains three classes of data with different properties: * Physical Network (PN): information about the chassis nodes in the system. This contains all the information necessary to wire the overlay, such as IP addresses, supported tunnel types, and security keys. The amount of PN data is small (O(n) in the number of chassis) and it changes infrequently, so it can be replicated to every chassis. * Logical Network (LN): the topology of logical switches and routers, ACLs, firewall rules, and everything needed to describe how packets traverse a logical network, represented as logical datapath flows (see Logical Datapath Flows, below). LN data may be large (O(n) in the number of logical ports, ACL rules, etc.). Thus, to improve scaling, each chassis should receive only data related to logical networks in which that chassis participates. Past experience shows that in the presence of large logical networks, even finer-grained partitioning of data, e.g. designing logical flows so that only the chassis hosting a logical port needs related flows, pays off scale-wise. (This is not necessary initially but it is worth bearing in mind in the design.) One may view LN data in at least two different ways. In one view, it is an ordinary database that must support all the traditional transactional operations that databases ordinarily include. From another viewpoint, the LN is a slave of the cloud management system running northbound of OVN. That CMS determines the entire OVN logical configuration and therefore the LN's content at any given time is a deterministic function of the CMS's configuration. From that viewpoint, it might be necessary only to have a single master (the CMS) provide atomic changes to the LN. Even durability may not be important, since the CMS can always provide a replacement snapshot. LN data is likely to change more quickly than PN data. This is especially true in a container environment where VMs are created and destroyed (and therefore added to and deleted from logical switches) quickly. * Bindings: the current placement of logical components (such as VMs and vifs) onto chassis and the bindings between logical ports and MACs. Bindings change frequently, at least every time a VM powers up or down or migrates, and especially quickly in a container environment. The amount of data per VM (or vif) is small. Each chassis is authoritative about the VMs and vifs that it hosts at any given time and can efficiently flood that state to a central location, so the consistency needs are minimal. +----------------------------------------+ | Cloud Management System | +----------------------------------------+ | | | | +------------------+ +------------------+ +------------------+ | Physical Network | | Logical Network | | Bindings | | (PN) | | (LN) | | | +------------------+ +------------------+ +------------------+ | | | | | | | | | | | | +----------+---------+----------------------+ | | | | | +-------|------------+----------+-----------+ | | +----------------+ +----------------+ | | | | | Hypervisor 1 | | Hypervisor 2 | | | | | +----------------+ +----------------+ An important design decision is the choice of database. It is also possible to choose multiple databases, dividing the data according to its different uses as described above. Some factors behind the choice of database: * Availability. Clustering could be helpful, but the ability to resynchronize cheaply from a rebooted database server (e.g. using "Difference Digests" described in Epstein et al., "What's the Difference? Efficient Set Reconciliation Without Prior Context") might be just as important. * The bindings database and to a lesser extent the LN database should support a high write rate. * The database should scale to a large number of connections (thousands, for a large OVN deployment) * The database should have C bindings. We initially plan to use OVSDB as the OVN database. ovsdb-server does not yet include clustering, nor does it have cheap resynchronization, nor does it scale to thousands of connections. None of these is fundamental to its design, so as bottlenecks arise we will add the necessary features as part of OVN development. (None of these features should cause backward incompatibility with existing OVSDB clients.) If this proves impracticable we will switch to an alternative database. The interfaces and partitioning of the system state are more important to get right; the implementations behind the interfaces are then simple to change. Cloud Management System ----------------------- OVN requires integration with the cloud management system in use. We will write a plugin to integrate OVN into OpenStack. The primary job of the plugin is to translate the CMS configuration, which forms the northbound API, into logical datapath flows in the OVN LN database. The CMS plugin may also update the PN. A significant amount of the code to translate the CMS configuration into logical datapath flows may be independent of the CMS in use. It should be possible to reuse this code from one CMS plugin to another. Logical Pipeline ================ In OVN, packet processing follows this process: * Physical ingress, from a VIF, a tunnel, or a gateway physical port. * Logical ingress. OVN identifies the packet's logical datapath and logical port. * Logical datapath processing. The packet passes through each of the stages in the ingress logical datapath. In the end, the logical datapath flows output the packet to zero or more logical egress ports. * Further logical datapath processing. If an egress logical port connects to another logical datapath, then the packet passes through that logical datapath in the same way as the initial logical datapath. A network of logical datapaths can connect into a logical topology, that e.g. represents a network of connected logical routers and switches. * Logical egress: Eventually, a packet that is not dropped is output to a logical port that has a physical realization. OVN identifies how to send a packet to the physical egress. * Physical egress, to a VIF, a tunnel, or a gateway physical port. The pipeline processing is split between the ingress and egress transport nodes. In particular, the logical egress processing may occur at either hypervisor. Processing the logical egress on the ingress hypervisor requires more state about the egress vif's policies, but reduces traffic on the wire that would eventually be dropped. Whereas, processing on the egress hypervisor can reduce broadcast traffic on the wire by doing local replication. We initially plan to process logical egress on the egress hypervisor so that less state needs to be replicated. However, we may change this behavior once we gain some experience writing the logical flows. Logical Datapath Flows ---------------------- The LN database specifies the logical topology as a set of logical datapath flows (as computed by OVN's CMS plugin). A logical datapath flow is much like an OpenFlow flow, except that the flows are written in terms of logical ports and logical datapaths instead of physical ports and physical datapaths. ovn-controller translates logical flows into physical flows. The translation process helps to ensure isolation between logical datapaths. The Pipeline table in the LN database stores the logical datapath flows. It has the following columns: * table_id: An integer that designates a stage in the logical pipeline, analogous to an OpenFlow table number. * priority: An integer between 0 and 65535 that designates the flow's priority. Flows with numerically higher priority take precedence over those with lower. If two logical datapath flows with the same priority both match, then the one actually applied to the packet is undefined. * match: A string specifying a matching expression (see below) that determines which packets the flow matches. * actions: A string specifying a sequence of actions (see below) to execute when the matching expression is satisfied. The default action when no flow matches is to drop packets. Matching Expressions -------------------- Matching expressions provide a superset of OpenFlow matching capabilities across packets in a logical datapath. Expressions use a syntax similar to Boolean expressions in a programming language. Matching expressions have two kinds of primaries: fields and constants. A field names a piece of data or metadata. The supported fields are: metadata reg0 ... reg7 xreg0 ... xreg3 inport outport queue eth.src eth.dst eth.type vlan.tci vlan.vid vlan.pcp vlan.present ip.proto ip.dscp ip.ecn ip.ttl ip.frag ip4.src ip4.dst ip6.src ip6.dst ip6.label arp.op arp.spa arp.tpa arp.sha arp.tha tcp.src tcp.dst tcp.flags udp.src udp.dst sctp.src sctp.dst icmp4.type icmp4.code icmp6.type icmp6.code nd.target nd.sll nd.tll Subfields may be addressed using a [] suffix, e.g. tcp.src[0..7] refers to the low 8 bits of the TCP source port. A subfield may be used in any context a field is allowed. Some fields have prerequisites. These are satisfied by implicitly adding clauses. For example, "arp.op == 1" is equivalent to "eth.type == 0x0806 && arp.op == 1", and "tcp.src == 80" is equivalent to "(eth.type == 0x0800 || eth.type == 0x86dd) && ip.proto == 6 && tcp.src == 80". Constants may be expressed in several forms: decimal integers, hexadecimal integers prefixed by 0x, dotted-quad IPv4 addresses, IPv6 addresses in their standard forms, and as Ethernet addresses as colon-separated hex digits. A constant in any of these forms may be followed by a slash and a second constant (the mask) in the same form, to form a masked constant. IPv4 and IPv6 masks may be given as integers, to express CIDR prefixes. The available operators, from highest to lowest precedence, are: () == != < <= > >= in not in ! && || The () operator is used for grouping. The equality operator == is the most important operator. Its operands must be a field and an optionally masked constant, in either order. The == operator yields true when the field's value equals the constant's value for all the bits included in the mask. The == operator translates simply and naturally to OpenFlow. The inequality operator != yields the inverse of == but its syntax and use are the same. Implementation of the inequality operator is expensive. The relational operators are <, <=, >, and >=. Their operands must be a field and a constant, in either order; the constant must not be masked. These operators are most commonly useful for L4 ports, e.g. "tcp.src < 1024". Implementation of the relational operators is expensive. The set membership operator "in", with syntax "<field> in { <constant1>, <constant2>, ... }", is syntactic sugar for "(<field> == <constant1> || <field> == <constant2> || ...)". Conversely "<field> not in { <constant1>, <constant2>, ... }" is syntactic sugar for "(<field> != <constant1> && <field> != <constant2> && ...)". The unary prefix operator ! yields its operand's inverse. The logical AND operator && yields true only if both of its operands are true. The logical OR operator || yields true if at least one of its operands is true. (The above is pretty ambitious. It probably makes sense to initially implement only a subset of this specification. The full specification is written out mainly to get an idea of what a fully general matching expression language could include.) Actions ------- Below, a <value> is either a <constant> or a <field>. The following actions seem most likely to be useful: drop syntactic sugar for no actions output(<value>) output to port broadcast output to every logical port except ingress port resubmit execute next logical datapath table as subroutine set(<field>=<value>) set data or metadata field, or copy between fields Following are not well thought out: learn conntrack with(<field>=<value) { <action>, ... } execute actions with temporary changes to fields dec_ttl { <action>, ... } { <action>, ...} decrement TTL; execute first set of actions if successful, second set if TTL decrement fails icmp_reply { <action>, ... } generate ICMP reply from packet, execute <action>s Other actions can be added as needed (e.g. push_vlan, pop_vlan, push_mpls, pop_mpls, ...). Some of the OVN actions do not map directly to OpenFlow actions, e.g.: * with: Implemented as "stack_push", "set", <actions>, "stack_pop". * dec_ttl: Implemented as dec_ttl followed by the successful actions. The failure case has to be implemented by ovn-controller interpreting packet-ins. It might be difficult to identify the particular place in the processing pipeline in ovn-controller; maybe some restrictions will be necessary. * icmp_reply: Implemented by sending the packet to ovn-controller, which generates the ICMP reply and sends the packet back to ovs-vswitchd. Implementing Features ===================== Each of the OVN logical network features is implemented as a table containing logical datapath flows and arranged into a pipeline. These are not fully fleshed out but here are some examples. Ingress Admissibility Check --------------------------- Some invariants of valid packets can be checked at ingress into the pipeline, e.g.: * Discard packets with multicast source: eth.src[40] == 1 * Discard packets with malformed VLAN header: eth.type == 0x8100 && !vlan.present * Discard BPDUs: eth.type == 01:80:c2:00:00:00/ff:ff:ff:ff:ff:f0 * We don't plan to implement logical switch VLANs for the first version of OVN, so drop VLAN-tagged packets: vlan.present A low-priority flow resubmits to the next pipeline stage. ACLs ---- Logical datapath flows for ACLs correspond closely to the ACLs themselves. "deny" ACLs drop packets, "allow" ACLs resubmit to the next pipeline stage, and default drop or allow are expressed as a low priority flow that drops or resubmits. L2 Switching ------------ Many logical L2 switches do not need to do MAC learning, because the MAC addresses of all of the VMs or logical routers on the switch are known. The flows required to process packets in this case are very simple: For each known (<mac>, <logical-port>) pair: eth.dst=<mac>, actions=set(reg0=<logical-port>), resubmit Multicast and broadcast are handled by repeating the actions above for every logical port (the "broadcast" action may be useful in some cases): eth.dst[40] == 1, actions=set(reg0=<logical-port-1>), resubmit, set(reg0=<logical-port-2>), resubmit, ... The above assumes that we use reg0 to designate the logical output port, but the particular register assignment doesn't matter as long as datapath logical flows are consistent. OpenFlow by default prevents a packet received on a particular OpenFlow port from being output back to the same OpenFlow port. We will want to do the same thing for logical ports in logical datapath switching; it could be implemented either in the definition of the logical datapath "output" and "broadcast" actions or in the logical datapath flows themselves. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev