Looks great. Thanks. --Justin
On Jan 9, 2013, at 2:10 PM, Ben Pfaff <b...@nicira.com> wrote: > On Tue, Jan 08, 2013 at 06:16:58PM -0800, Justin Pettit wrote: >>> + * - A "flow", that is, a summary of the headers in a Ethernet packet. >>> The >> >> s/a/an/ > > Fixed. > >> This sort of sounds like only the Ethernet header fields make up the >> flow. Maybe "L2/L3/L4 headers" or something like that? > > I explain that in the third paragraph: > * Flows are fine-grained entities that include L2, L3, and L4 headers. A > * single TCP connection consists of two flows, one in each direction. > I agree that it's good to get an example early, so I merged that > paragraph into this one, > >>> + * (In case you are familiar with OpenFlow, datapath flows are >>> analogous >>> + * to OpenFlow flow matches. The most important difference is that >>> + * OpenFlow allows fields to be wildcarded, whereas a datapath's flow >>> + * table is a hash table so every flow must be exact-match.) >> >> I might add "and prioritized" after "wildcarded", since this often >> seems to trip people up in understanding the datapath flow table. > > Done, thanks. (I've never quite understood how they think a hash > table should be prioritized, but whatever.) > >>> + * The actions list may be empty. This indicates that nothing should >>> be >>> + * done to matching packet, e.g. they should be dropped. >> >> s/packet/packets/ > > Done. > >> Is this an "e.g." or an "i.e."? Isn't the packet always going to be dropped? > > "i.e." > > I changed it to "that is". > >>> + * An upcall contains an entire packet. There is no attempt to, e.g., copy >>> + * only as much of the packet as normally needed to make a forwarding >>> decision. >>> + * Such an optimization is doable, but experimental prototypes showed it >>> to be >>> + * of little benefit because an upcall typically contains the first packet >>> of a >>> + * flow, which is usually short (e.g. a TCP SYN). >> >> I'm not sure we want to only use this justification, since we also >> use the packet for things like packet sampling and deeper inspection >> for in-band. > > OK, I added another sentence. > >>> + * The datapath should ensure that that a high rate of upcalls from one >> >> There are two "that"s. > > They were on sale. > >>> + * The client has some control over "action" upcalls: it can specify a >>> 32-bit >>> + * "Netlink PID" as part of the action. This terminology comes from the >>> Linux >>> + * datapath implementation, which uses a protocol called Netlink in which >>> a PID >>> + * designates a particular socket and the upcall data is delivered to the >>> + * socket's received queue. Generically, though, a Netlink PID identifies >>> a >>> + * queue for upcalls. The basic requirements on the datapath are: >> >> Is it a "received queue" or a "receive queue"? I always thought it >> was the latter (i.e., no "d"). > > "receive queue". Fixed. > >>> + * - The datapath must provide a Netlink PID associated with each port. >>> The >>> + * client can retrieve the PID with dpif_port_get_pid(). >>> + * >>> + * - The datapath must provide an additional Netlink PID, not associated >>> + * with any port. dpif_port_get_pid() also provides this PID. >> >> I think it would be nice to explain why this other PID is needed >> (and possibly explain that the value is UINT32_MAX). > > I added a note: > > * (ovs-vswitchd uses this additional PID to queue "special" packets that > * must not be lost even if a port is otherwise busy, such as packets used > * for tunnel monitoring.) > > The special PID value isn't UINT32_MAX, that's just the > dpif_port_get_pid() argument used to obtain special PID. The reader > should be able to find that out from reading the details of the > interface; I don't see a need to say it here too. > >>> + * - Upcalls that specify the additional Netlink PID are queued >>> separately. >> >> Calling this the "additional Netlink PID" seems insufficiently >> specific. What about calling it something like the "system Netlink >> PID" here and where it was introduced earlier? > > I ended up calling it the "special" Netlink PID, hope that's OK. > >>> + * For each upcall received, the client examines the enclosed packet and >>> + * figures out what should be done with it. For example, if the client >>> + * implements a MAC-learning switch, then it searches the forwarding >>> database >>> + * for the packet's destination MAC and VLAN and determines the set of >>> ports to >>> + * which it should be sent. In any case, the client composes a set of >>> datapath >>> + * actions to properly dispatch the packet and then directs the datapath to >>> + * execute those actions on the packet (e.g. with dpif_execute()). >> >> Is it an "e.g." or an "i.e."? > > Other functions can do this. ofproto-dpif actually uses > dpif_operate(), I think. > > Here's an incremental and then the revised patch. Any further > comments? > > diff --git a/lib/dpif.h b/lib/dpif.h > index 9b45850..a478db2 100644 > --- a/lib/dpif.h > +++ b/lib/dpif.h > @@ -1,5 +1,5 @@ > /* > - * Copyright (c) 2008, 2009, 2010, 2011, 2012 Nicira, Inc. > + * Copyright (c) 2008, 2009, 2010, 2011, 2012, 2013 Nicira, Inc. > * > * Licensed under the Apache License, Version 2.0 (the "License"); > * you may not use this file except in compliance with the License. > @@ -105,8 +105,10 @@ > * > * The flow table is a hash table of "flow entries". Each flow entry > contains: > * > - * - A "flow", that is, a summary of the headers in a Ethernet packet. > The > + * - A "flow", that is, a summary of the headers in an Ethernet packet. > The > * flow is the hash key and thus must be unique within the flow table. > + * Flows are fine-grained entities that include L2, L3, and L4 headers. > A > + * single TCP connection consists of two flows, one in each direction. > * > * In Open vSwitch userspace, "struct flow" is the typical way to > describe > * a flow, but the datapath interface uses a different data format to > @@ -115,13 +117,11 @@ > * "struct ovs_key_*" in include/linux/openvswitch.h for details. > * lib/odp-util.h defines several functions for working with these flows. > * > - * Flows are fine-grained entities that include L2, L3, and L4 headers. > A > - * single TCP connection consists of two flows, one in each direction. > - * > * (In case you are familiar with OpenFlow, datapath flows are analogous > * to OpenFlow flow matches. The most important difference is that > - * OpenFlow allows fields to be wildcarded, whereas a datapath's flow > - * table is a hash table so every flow must be exact-match.) > + * OpenFlow allows fields to be wildcarded and prioritized, whereas a > + * datapath's flow table is a hash table so every flow must be > + * exact-match, thus without priorities.) > * > * - A list of "actions" that tell the datapath what to do with packets > * within a flow. Some examples of actions are OVS_ACTION_ATTR_OUTPUT, > @@ -132,7 +132,7 @@ > * actions. > * > * The actions list may be empty. This indicates that nothing should be > - * done to matching packet, e.g. they should be dropped. > + * done to matching packets, that is, they should be dropped. > * > * (In case you are familiar with OpenFlow, datapath actions are > analogous > * to OpenFlow actions.) > @@ -165,7 +165,8 @@ > * only as much of the packet as normally needed to make a forwarding > decision. > * Such an optimization is doable, but experimental prototypes showed it to be > * of little benefit because an upcall typically contains the first packet of > a > - * flow, which is usually short (e.g. a TCP SYN). > + * flow, which is usually short (e.g. a TCP SYN). Also, the entire packet > can > + * sometimes really be needed. > * > * After a client reads a given upcall, the datapath is finished with it, that > * is, the datapath doesn't maintain any lingering state past that point. > @@ -197,12 +198,12 @@ > * implementation, is that all upcalls are appended to a single queue, which > is > * delivered to the client in order. > * > - * The datapath should ensure that that a high rate of upcalls from one > - * particular port cannot cause upcalls from other sources to be dropped or > - * unreasonably delayed. Otherwise, one port conducting a port scan or > - * otherwise initiating high-rate traffic spanning many flows could suppress > - * other traffic. Ideally, the datapath should present upcalls from each > port > - * in a "round robin" manner, to ensure fairness. > + * The datapath should ensure that a high rate of upcalls from one particular > + * port cannot cause upcalls from other sources to be dropped or unreasonably > + * delayed. Otherwise, one port conducting a port scan or otherwise > initiating > + * high-rate traffic spanning many flows could suppress other traffic. > + * Ideally, the datapath should present upcalls from each port in a "round > + * robin" manner, to ensure fairness. > * > * The client has no control over "miss" upcalls and no insight into the > * datapath's implementation, so the datapath is entirely responsible for > @@ -219,14 +220,16 @@ > * "Netlink PID" as part of the action. This terminology comes from the Linux > * datapath implementation, which uses a protocol called Netlink in which a > PID > * designates a particular socket and the upcall data is delivered to the > - * socket's received queue. Generically, though, a Netlink PID identifies a > + * socket's receive queue. Generically, though, a Netlink PID identifies a > * queue for upcalls. The basic requirements on the datapath are: > * > * - The datapath must provide a Netlink PID associated with each port. > The > * client can retrieve the PID with dpif_port_get_pid(). > * > - * - The datapath must provide an additional Netlink PID, not associated > - * with any port. dpif_port_get_pid() also provides this PID. > + * - The datapath must provide a "special" Netlink PID not associated with > + * any port. dpif_port_get_pid() also provides this PID. (ovs-vswitchd > + * uses this PID to queue special packets that must not be lost even if > a > + * port is otherwise busy, such as packets used for tunnel monitoring.) > * > * The minimal behavior of dpif_port_get_pid() and the treatment of the > Netlink > * PID in "action" upcalls is that dpif_port_get_pid() returns a constant > value > @@ -244,7 +247,7 @@ > * were received, regardless of whether the upcalls are "miss" or > "action" > * upcalls. > * > - * - Upcalls that specify the additional Netlink PID are queued > separately. > + * - Upcalls that specify the "special" Netlink PID are queued separately. > * > * > * Packet Format > > --8<--------------------------cut here-------------------------->8-- > > From: Ben Pfaff <b...@nicira.com> > Date: Wed, 9 Jan 2013 14:10:46 -0800 > Subject: [PATCH] dpif: Document. > > Signed-off-by: Ben Pfaff <b...@nicira.com> > --- > lib/dpif.h | 307 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- > 1 files changed, 305 insertions(+), 2 deletions(-) > > diff --git a/lib/dpif.h b/lib/dpif.h > index 893338b..a478db2 100644 > --- a/lib/dpif.h > +++ b/lib/dpif.h > @@ -1,5 +1,5 @@ > /* > - * Copyright (c) 2008, 2009, 2010, 2011, 2012 Nicira, Inc. > + * Copyright (c) 2008, 2009, 2010, 2011, 2012, 2013 Nicira, Inc. > * > * Licensed under the Apache License, Version 2.0 (the "License"); > * you may not use this file except in compliance with the License. > @@ -14,7 +14,310 @@ > * limitations under the License. > */ > > - > +/* > + * dpif, the DataPath InterFace. > + * > + * In Open vSwitch terminology, a "datapath" is a flow-based software switch. > + * A datapath has no intelligence of its own. Rather, it relies entirely on > + * its client to set up flows. The datapath layer is core to the Open > vSwitch > + * software switch: one could say, without much exaggeration, that everything > + * in ovs-vswitchd above dpif exists only to make the correct decisions > + * interacting with dpif. > + * > + * Typically, the client of a datapath is the software switch module in > + * "ovs-vswitchd", but other clients can be written. The "ovs-dpctl" utility > + * is also a (simple) client. > + * > + * > + * Overview > + * ======== > + * > + * The terms written in quotes below are defined in later sections. > + * > + * When a datapath "port" receives a packet, it extracts the headers (the > + * "flow"). If the datapath's "flow table" contains a "flow entry" whose > flow > + * is the same as the packet's, then it executes the "actions" in the flow > + * entry and increments the flow's statistics. If there is no matching flow > + * entry, the datapath instead appends the packet to an "upcall" queue. > + * > + * > + * Ports > + * ===== > + * > + * A datapath has a set of ports that are analogous to the ports on an > Ethernet > + * switch. At the datapath level, each port has the following information > + * associated with it: > + * > + * - A name, a short string that must be unique within the host. This is > + * typically a name that would be familiar to the system administrator, > + * e.g. "eth0" or "vif1.1", but it is otherwise arbitrary. > + * > + * - A 32-bit port number that must be unique within the datapath but is > + * otherwise arbitrary. The port number is the most important > identifier > + * for a port in the datapath interface. > + * > + * - A type, a short string that identifies the kind of port. On a Linux > + * host, typical types are "system" (for a network device such as eth0), > + * "internal" (for a simulated port used to connect to the TCP/IP > stack), > + * and "gre" (for a GRE tunnel). > + * > + * - A Netlink PID (see "Upcall Queuing and Ordering" below). > + * > + * The dpif interface has functions for adding and deleting ports. When a > + * datapath implements these (e.g. as the Linux and netdev datapaths do), > then > + * Open vSwitch's ovs-vswitchd daemon can directly control what ports are > used > + * for switching. Some datapaths might not implement them, or implement them > + * with restrictions on the types of ports that can be added or removed > + * (e.g. on ESX), on systems where port membership can only be changed by > some > + * external entity. > + * > + * Each datapath must have a port, sometimes called the "local port", whose > + * name is the same as the datapath itself, with port number 0. The local > port > + * cannot be deleted. > + * > + * Ports are available as "struct netdev"s. To obtain a "struct netdev *" > for > + * a port named 'name' with type 'port_type', in a datapath of type > + * 'datapath_type', call netdev_open(name, dpif_port_open_type(datapath_type, > + * port_type). The netdev can be used to get and set important data related > to > + * the port, such as: > + * > + * - MTU (netdev_get_mtu(), netdev_set_mtu()). > + * > + * - Ethernet address (netdev_get_etheraddr(), netdev_set_etheraddr()). > + * > + * - Statistics such as the number of packets and bytes transmitted and > + * received (netdev_get_stats()). > + * > + * - Carrier status (netdev_get_carrier()). > + * > + * - Speed (netdev_get_features()). > + * > + * - QoS queue configuration (netdev_get_queue(), netdev_set_queue() and > + * related functions.) > + * > + * - Arbitrary port-specific configuration parameters > (netdev_get_config(), > + * netdev_set_config()). An example of such a parameter is the IP > + * endpoint for a GRE tunnel. > + * > + * > + * Flow Table > + * ========== > + * > + * The flow table is a hash table of "flow entries". Each flow entry > contains: > + * > + * - A "flow", that is, a summary of the headers in an Ethernet packet. > The > + * flow is the hash key and thus must be unique within the flow table. > + * Flows are fine-grained entities that include L2, L3, and L4 headers. > A > + * single TCP connection consists of two flows, one in each direction. > + * > + * In Open vSwitch userspace, "struct flow" is the typical way to > describe > + * a flow, but the datapath interface uses a different data format to > + * allow ABI forward- and backward-compatibility. datapath/README > + * describes the rationale and design. Refer to OVS_KEY_ATTR_* and > + * "struct ovs_key_*" in include/linux/openvswitch.h for details. > + * lib/odp-util.h defines several functions for working with these > flows. > + * > + * (In case you are familiar with OpenFlow, datapath flows are analogous > + * to OpenFlow flow matches. The most important difference is that > + * OpenFlow allows fields to be wildcarded and prioritized, whereas a > + * datapath's flow table is a hash table so every flow must be > + * exact-match, thus without priorities.) > + * > + * - A list of "actions" that tell the datapath what to do with packets > + * within a flow. Some examples of actions are OVS_ACTION_ATTR_OUTPUT, > + * which transmits the packet out a port, and OVS_ACTION_ATTR_SET, which > + * modifies packet headers. Refer to OVS_ACTION_ATTR_* and "struct > + * ovs_action_*" in include/linux/openvswitch.h for details. > + * lib/odp-util.h defines several functions for working with datapath > + * actions. > + * > + * The actions list may be empty. This indicates that nothing should be > + * done to matching packets, that is, they should be dropped. > + * > + * (In case you are familiar with OpenFlow, datapath actions are > analogous > + * to OpenFlow actions.) > + * > + * - Statistics: the number of packets and bytes that the flow has > + * processed, the last time that the flow processed a packet, and the > + * union of all the TCP flags in packets processed by the flow. (The > + * latter is 0 if the flow is not a TCP flow.) > + * > + * The datapath's client manages the flow table, primarily in reaction to > + * "upcalls" (see below). > + * > + * > + * Upcalls > + * ======= > + * > + * A datapath sometimes needs to notify its client that a packet was > received. > + * The datapath mechanism to do this is called an "upcall". > + * > + * Upcalls are used in two situations: > + * > + * - When a packet is received, but there is no matching flow entry in its > + * flow table (a flow table "miss"), this causes an upcall of type > + * DPIF_UC_MISS. These are called "miss" upcalls. > + * > + * - A datapath action of type OVS_ACTION_ATTR_USERSPACE causes an upcall > of > + * type DPIF_UC_ACTION. These are called "action" upcalls. > + * > + * An upcall contains an entire packet. There is no attempt to, e.g., copy > + * only as much of the packet as normally needed to make a forwarding > decision. > + * Such an optimization is doable, but experimental prototypes showed it to > be > + * of little benefit because an upcall typically contains the first packet > of a > + * flow, which is usually short (e.g. a TCP SYN). Also, the entire packet > can > + * sometimes really be needed. > + * > + * After a client reads a given upcall, the datapath is finished with it, > that > + * is, the datapath doesn't maintain any lingering state past that point. > + * > + * The latency from the time that a packet arrives at a port to the time that > + * it is received from dpif_recv() is critical in some benchmarks. For > + * example, if this latency is 1 ms, then a netperf TCP_CRR test, which opens > + * and closes TCP connections one at a time as quickly as it can, cannot > + * possibly achieve more than 500 transactions per second, since every > + * connection consists of two flows with 1-ms latency to set up each one. > + * > + * To receive upcalls, a client has to enable them with dpif_recv_set(). A > + * datapath should generally support multiple clients at once (e.g. so that > one > + * may run "ovs-dpctl show" or "ovs-dpctl dump-flows" while "ovs-vswitchd" is > + * also running) but need not support multiple clients enabling upcalls at > + * once. > + * > + * > + * Upcall Queuing and Ordering > + * --------------------------- > + * > + * The datapath's client reads upcalls one at a time by calling dpif_recv(). > + * When more than one upcall is pending, the order in which the datapath > + * presents upcalls to its client is important. The datapath's client does > not > + * directly control this order, so the datapath implementer must take care > + * during design. > + * > + * The minimal behavior, suitable for initial testing of a datapath > + * implementation, is that all upcalls are appended to a single queue, which > is > + * delivered to the client in order. > + * > + * The datapath should ensure that a high rate of upcalls from one particular > + * port cannot cause upcalls from other sources to be dropped or unreasonably > + * delayed. Otherwise, one port conducting a port scan or otherwise > initiating > + * high-rate traffic spanning many flows could suppress other traffic. > + * Ideally, the datapath should present upcalls from each port in a "round > + * robin" manner, to ensure fairness. > + * > + * The client has no control over "miss" upcalls and no insight into the > + * datapath's implementation, so the datapath is entirely responsible for > + * queuing and delivering them. On the other hand, the datapath has > + * considerable freedom of implementation. One good approach is to maintain > a > + * separate queue for each port, to prevent any given port's upcalls from > + * interfering with other ports' upcalls. If this is impractical, then > another > + * reasonable choice is to maintain some fixed number of queues and assign > each > + * port to one of them. Ports assigned to the same queue can then interfere > + * with each other, but not with ports assigned to different queues. Other > + * approaches are also possible. > + * > + * The client has some control over "action" upcalls: it can specify a 32-bit > + * "Netlink PID" as part of the action. This terminology comes from the > Linux > + * datapath implementation, which uses a protocol called Netlink in which a > PID > + * designates a particular socket and the upcall data is delivered to the > + * socket's receive queue. Generically, though, a Netlink PID identifies a > + * queue for upcalls. The basic requirements on the datapath are: > + * > + * - The datapath must provide a Netlink PID associated with each port. > The > + * client can retrieve the PID with dpif_port_get_pid(). > + * > + * - The datapath must provide a "special" Netlink PID not associated with > + * any port. dpif_port_get_pid() also provides this PID. (ovs-vswitchd > + * uses this PID to queue special packets that must not be lost even if > a > + * port is otherwise busy, such as packets used for tunnel monitoring.) > + * > + * The minimal behavior of dpif_port_get_pid() and the treatment of the > Netlink > + * PID in "action" upcalls is that dpif_port_get_pid() returns a constant > value > + * and all upcalls are appended to a single queue. > + * > + * The ideal behavior is: > + * > + * - Each port has a PID that identifies the queue used for "miss" upcalls > + * on that port. (Thus, if each port has its own queue for "miss" > + * upcalls, then each port has a different Netlink PID.) > + * > + * - "miss" upcalls for a given port and "action" upcalls that specify > that > + * port's Netlink PID add their upcalls to the same queue. The upcalls > + * are delivered to the datapath's client in the order that the packets > + * were received, regardless of whether the upcalls are "miss" or > "action" > + * upcalls. > + * > + * - Upcalls that specify the "special" Netlink PID are queued separately. > + * > + * > + * Packet Format > + * ============= > + * > + * The datapath interface works with packets in a particular form. This is > the > + * form taken by packets received via upcalls (i.e. by dpif_recv()). Packets > + * supplied to the datapath for processing (i.e. to dpif_execute()) also take > + * this form. > + * > + * A VLAN tag is represented by an 802.1Q header. If the layer below the > + * datapath interface uses another representation, then the datapath > interface > + * must perform conversion. > + * > + * The datapath interface requires all packets to fit within the MTU. Some > + * operating systems internally process packets larger than MTU, with > features > + * such as TSO and UFO. When such a packet passes through the datapath > + * interface, it must be broken into multiple MTU or smaller sized packets > for > + * presentation as upcalls. (This does not happen often, because an upcall > + * typically contains the first packet of a flow, which is usually short.) > + * > + * Some operating system TCP/IP stacks maintain packets in an unchecksummed > or > + * partially checksummed state until transmission. The datapath interface > + * requires all host-generated packets to be fully checksummed (e.g. IP and > TCP > + * checksums must be correct). On such an OS, the datapath interface must > fill > + * in these checksums. > + * > + * Packets passed through the datapath interface must be at least 14 bytes > + * long, that is, they must have a complete Ethernet header. They are not > + * required to be padded to the minimum Ethernet length. > + * > + * > + * Typical Usage > + * ============= > + * > + * Typically, the client of a datapath begins by configuring the datapath > with > + * a set of ports. Afterward, the client runs in a loop polling for upcalls > to > + * arrive. > + * > + * For each upcall received, the client examines the enclosed packet and > + * figures out what should be done with it. For example, if the client > + * implements a MAC-learning switch, then it searches the forwarding database > + * for the packet's destination MAC and VLAN and determines the set of ports > to > + * which it should be sent. In any case, the client composes a set of > datapath > + * actions to properly dispatch the packet and then directs the datapath to > + * execute those actions on the packet (e.g. with dpif_execute()). > + * > + * Most of the time, the actions that the client executed on the packet apply > + * to every packet with the same flow. For example, the flow includes both > + * destination MAC and VLAN ID (and much more), so this is true for the > + * MAC-learning switch example above. In such a case, the client can also > + * direct the datapath to treat any further packets in the flow in the same > + * way, using dpif_flow_put() to add a new flow entry. > + * > + * Other tasks the client might need to perform, in addition to reacting to > + * upcalls, include: > + * > + * - Periodically polling flow statistics, perhaps to supply to its own > + * clients. > + * > + * - Deleting flow entries from the datapath that haven't been used > + * recently, to save memory. > + * > + * - Updating flow entries whose actions should change. For example, if a > + * MAC learning switch learns that a MAC has moved, then it must update > + * the actions of flow entries that sent packets to the MAC at its old > + * location. > + * > + * - Adding and removing ports to achieve a new configuration. > + */ > #ifndef DPIF_H > #define DPIF_H 1 > > -- > 1.7.2.5 > _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev