Yesterday, we held a very productive OVS meeting as part of the Linux Plumber's Conference. Below are the notes that were taken to record the meeting. Thanks to all that participated!
============================================================ OVS Micro Summit 2014, Oct 15, 2014, Düsseldorf Attendees: Johann Tönsing (Netronome), Simon Horman (Netronome), Rob Truesdell (Netronome), John Fastabend (Intel), Or Gerlitz (Mellanox) Lori Jakab (Cisco), Jiri Pirko (Red Hat), Alexei Starovoitov (Plumgrid), Zoltan Lajos Kis (Ericsson), Justin Pettit (VMware), Jesse Gross (VMware) Thomas Graf (Noiro/Cisco), Daniel Borkmann (Red Hat), Jiri Benc (Red Hat), Dan Dumitriu (Midokura), Guillermo Ontanon (Midokura), Jiri Pirko (Red Hat) Thomas Bachman (Noiro/Cisco), Roni (Mellanox) Agenda What problems do you face today? Datapath and vanilla kernel out of sync Where to put the primary datapath repo? Copy both ovs-dev and netdev on datapath patch proposals => Many agree with this suggestion - no brainer / at least do this Patches need to be on appropriate repo otherwise people won't look at them Do development of the kernel part in consultation primarily with netdev? Red Hat prefers this approach - RHEL can't take code from other repos May be problematic due to userspace being so large in OVS Is the userspace part however critical when proposing patches? or is API to userspace sufficient to enable evaluating changes? OVS repo contains many backports - keep them there? Many prefer compat framework for maintaining the backports What are the gaps currently? MPLS waiting for net-next opening, ready to be merged LISP not ready Negative feedback received on some patches Multiple user space implementations including prop ones Official userspace datapath on openvswitch.org is used on BSD, DPDK and for testing Intel OVDK merging with ovs.org Prop user space datapaths remain Hardware offload - No interest from VMware to maintain ABI - would be based on source For Hyper-V this would be aligned with netlink Offload on flattened flows should be a configurable operation Partial / selective offload is a must as limited hardware is a given Where to put the logic to decide what can be offloaded? User space seems a logic choice to not overload the kernel with complexity Even those who only use OVS kernel code agree too much kernel complexity is not advisable Perhaps deploy a new entity interfacing to acceleration - different to OVS kernel + OVS user space - to accommodate those with different userspace? (Some even are considering hosting this on the controller) Ability must be provided to fall back to both kernel and user space Offload must be possible even in future setups where hardware and software datapath do not share a common model (e.g. current cached flow table vs. P4) Proposals on a kernel offload API: Jiri Pirko: SWDEV to abstract representation of hardware switches as net devices with additional NDOs to allow offloading flows Offload decision is hooked into the OVS kernel datapath, i.e. OVS calls into the NDO hooks directly. John Fastabend (Intel): Move the policy decision to user space and provide a Netlink interface to export hardware capabilities to user space and allow user space to inject flows into the hardware using a common API How do you model the hardware? Capabilities exported as graphs Patches posted to John's github page https://github.com/jrfastab/flow-net-next Netronome: implemented three acceleration options (seeing use cases requiring each of these) usermode / accelerator with entire datapath offloaded (ofproto level hooks) usermode / accelerator with traffic sent for fallback processing to usermode (ofproto level hooks) usermode / kernel / accelerator with traffic sent for fallback processing to kernel then to usermode (kernel hooks for flow insertion/deletion and vport add/remove) Agreement on merging Jiri's and John's proposal into a single generic Netlink based offload API Netlink as-is likely too slow to handle both jumbo frames to user space and high volume flow updates Memory mapped netlink has just been removed May need an async message option Netronome implemented control message based transport from userspace directly to acceleration hardware - async control msg model Intel developed a thin netlink layer mapping to messages close to those required by hardware Difficult to encode capabilities of diverse hardware platforms May need to encode capabilities as pluggable code, not just data Options to model + proposed sequence in which to implement Easiest: model each hardware / software entity as separate virtual switch, connect these by internal ports over which packets (without OpenFlow metadata) flow More difficult: split at table level - as well as QoS / similar size major blocks; each table implementable by different hardware / software instances - need to convey OpenFlow metadata Hardest: permit each action in action list to be implemented by different entity - difficult to e.g. hand off OpenFlow / OVS register contents etc. Resistance to extending the kernel packet structure with additional metadata How will userspace know which table fields / match options (exact vs wildcard) / actions etc. will be employed - to enable it to employ the most efficient model with sufficient semantics supported by hardware? Table features vs. TTPs vs. OVSDB / OpenStack etc extension etc. Unclear how it will know - assume for our purposes it will know - but may need to backtrack if got it wrong Security Updates for OVS New mailing list for 0day incidents Status updates (see slides in dropbox) https://www.dropbox.com/s/t1ikm6ij06z80ex/LoriJakab_OVSMicroConference.pdf?dl=0 LISP (Lori Jakab) LISP implemented by border routers Avoids each leaf node in system needing to be in each router RFCs specify use of LISP for overlays (see slides for details) One use case is maintaining connectivity for mobile hosts No OpenFlow support - potential relevant tickets: EXT-112 (making good progress) + EXT-382 (not guaranteed to proceed - prototyping stalled + controversial, might be replaced with different protocol independent layer Working on separate LISP kernel module - analogous to existing GRE/VXLAN modules to which OVS now interfaces Was not accepted because it is another route cache More generic encap mechanism - Generic Protocol Extension - can be leveraged for LISP Dislike different next protocol field - would prefer just Ethertype GENEVE could also be used (only L2 inner has been implemented but could be L3 according to draft) By setting various GPE bits e.g. flags to zero a valid LISP packet results Conntrack / NAT / Crypto http://openvswitch.org/slides/OpenStack-140513.pdf Conntrack No capability to track/share state between packets and use connection tracking Existing method to implement reflexive ACL was to use the learn action to learn a flow in the opposite direction. Minimal security guarantees and performance issue More recently, support for tcp_flags matching was added allowing to match on ACK bits - still does not handle out of TCP window packets The idea is to use the existing netfilter connection tracking instead and allow storing/retrieving state The feature includes a new action conntrack() allowing to feed packets to the conntrack and a conn_state to match on connection state Patch supplied for feedback - then integrated zone support from Thomas G. Still need to enhance the code to send fragmented packets through frag handling code Upper performance bound around 6Gbit when going through netfilter New vendor extension action specifies zone and whether or not to recirculate Interest exists to supply a compatible userspace implementation (e.g. PF based or different) - even potentially accelerated - however interface needs to be clean enough for this (not just mirror Linux's interface) Metadata handling needs to be considered NAT Thomas posted patch to add NAT action assuming connection tracking state already exists Would support stateful NAT (translate L4 ports) Tricky to handle bidirectional traffic Easiest is to position on one "side" of the switch e.g. at egress to public port / at ingress from public port - to ensure OpenFlow always sees private addresses Initially no need to expose table contents to OpenFlow therefore OK to expose as NAT + un-NAT actions deployable at separate places in packet processing pipeline Expose as synchronized tables once controller vendors want to see contents of NAT tables Again Linux kernel datapath only Crypto (IPsec) Similar to conntrack - currently kernel feature, would be missing in userspace etc. Prefer to have a mechanism to deploy in userspace, kernel and accelerators Question is whether OK to keep outside OpenFlow vs. whether parts? all? needs to be exposed to OpenFlow eBPF based datapath (Alexei S. @ PLUMgrid) Current focus of eBPF is on tracing Motivation to integrate eBPF with the OVS kernel datpath Provides additional programmability similar to P4 vs OF 1.x - some semantics e.g. complex logic difficult to express using tables (Separate) Use cases: Non OpenFlow - e.g. potentially traditional networking - flexibly deploy e.g. L2 with learning etc. High level optimization similar to nftables Possible approach to protocol independent parsing exposed through OF Can act as glue between tables, does not need to replace matching in particular tables Could also replace / encode parsing logic Incremental parsing (for improved performance) vs up front complete parsing To reconcile this with option 1... can use a BPF based pre-flow-table option (for parsing only) and post-flow-table option (further processing) If table is empty can further optimize this by having pre and post be replaced with simpler unified one Is this needed though - if we ignore PIF type usage? Could for example use this to obtain TCP window sizes for analytics... Q: Does this need to integrate with with OVS? Why not just hook into ingress at netdev? A: More options available w.r.t. where to divert traffic when integrated. Option 1a: Keep existing megaflow hash tables and call eBPF on flow miss Option 1b: eBPF as an action Consensus that this is the easiest to implement BPF program is provided by user space (not necessarily exposed to controller - initially not) Could provide an easy angle for new actions without requiring to go through the heavy process of adding a new datapath action Not as flexible as C, which is good, as can potentially compile to certain hardware platforms Option 2: Replace full lookup & execution with BPF code Potential Option 4: Table matches fields, additional table column contains expression which also needs to be matched (here tables are main control logic, expression is add on to each row) Expressiveness: limited execution time run to completion (no loops); can call out to functions (which could be implemented in hardware or software) Conceptually a program is set of connected netdevs, each with multiple ports, which are connected in some topology; can collapse some nodes into fewer for improved performance Potential concerns on compatibility with existing ABI and requirement on maintaining two parallel datapath implementations going forward (flow lookup and BPF) Would need to keep the old configuration ABI. Possibly provide compat through a BPF program. Initially retain existing C code for parsing as faster anyway. Can't break userspace if default behavior remains unchanged as userspace would know whether a program / which program has been downloaded Need to constrain which kernel functions are permissible to call - e.g. output to port, add header, compute checksums Especially important when permitting userspace and accelerated target platforms too Take care not to disrupt existing GSO checksum handling e.g. related metadata / flags / offsets prepended to packets, or explicitly permit these to be set Code can be made availble after a rebase onto the latest BPF changes Exposing the idea of BPF to the controller opens a new set of questions Conceptually need overall control flow mechanism (around say OVS, IPsec, QoS etc), and a detailed packet manipulation mechanism - need to decide which of these eBPF will perform (only detailed vs both...) and how to expose this Would determine where to hook it in and which people need to be involved (OVS vs general Linux community vs. OpenFlow etc.) Steps forward: 0. Add BPF program invocation to sockets 0.5 Add Extend cls_bpf with eBPF capability ( daniel will take care ;)) 1. Add read-only BPF program as actions to OVS - used for convenience of userspace - not exposed to OpenFlow (not even as custom action) 2. Enable programs to write to packets and forward packets to ports - again initially not exposed to OpenFlow 3. Add ABI to handle encapsulation w/ offload 4. Add possibility to run BPF program on flow miss M. Implement only in userspace without kernel... e.g. on DPDK (for some value of N... depends on market demand) Enables BPF programs to be exposed - e.g. downloaded by controller - and running on the various available hardware / software platforms N. Implementations for acceleration hardware N+1.eBPF only OVS data path Further discussion of the "outer control flow" in ONF Forwarding Abstractions WG, and of Protocol Independent Forwarding part in ONF PIF open source project Schedule follow up discussions on next meetups Start advertising the idea on blog OpenFlow API for encap metadata Geneve and other encap protocol introduces metadata (options conveyed in packets) The question is how to expose this metadata with OpenFlow Considering passing through to userspace and beyond as opaque values somehow GENEVE type space is large - would consume entire OXM space => need to extend OXM class Recently experimenter OXM ID space size was reduced - see https://rs.opennetworking.org/bugs/browse/EXT-380 and ensuing discussion Nevertheless could use experimenter OXM encoding for this - use a dedicated experimenter ID Desire to handle properitary encap protocols with metadata in a way that allows mapping to Geneve TLVs in the future An eBPF converter to map generic tunnel metadata to specific protocol headers would provide sufficient flexibility Issues are representing this within a switch, accessible via matching / actions, and across the network, as a lighter weight than packet in/out but more expressive than tunnel format Zoltan - packet processors - https://rs.opennetworking.org/bugs/browse/EXT-122 Examples of issues with existing logical port scheme: cannot chain these, cannot perform variable actions if MTU exceeded or not etc. Therefore need more flexible mechanism Can perform opaque operation in ASIC, or pipe to control processor to perform it, and back Invoke these via experimenter IDs See also tasks proposal - slides attached to https://rs.opennetworking.org/bugs/browse/EXT-494 - this refactors action set/list, actions vs instructions, flow vs group vs egress tables etc. See also protocol independent forwarding - would have built in actions / functions as well as external names opaque functions which can be invoked Other potential features to work on No major wish list items for OpenFlow control protocol level since more recent OpenFlow versions have been implemented QoS / metering issues: accuracy of implementations (easier to achieve with hardware than software), representaton in OpenFlow / OF-Config poorly defined John to provide RFC patchset to allow hardware offload of TBF per queue and eventually HTB for flat hierarchy Consider deriving abstraction from the various implementations - then define generic way to expose to OpenFlow / OF-Config / OVSDB etc. Consensus on organizing meetups like this again in the future Perhaps paste wishlist items into a document _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev