On Tue, Nov 27, 2012 at 12:19 PM, Kyle Mestery <kmest...@cisco.com> wrote: > Note: v4 of this patch removes VXLAN over IPSEC support, > per an offline conversation with Jesse. > > Add support for VXLAN tunnels to Open vSwitch. Add support > for setting the destination UDP port on a per-port basis. > This is done by adding a "dst_port" parameter to the port > configuration. This is only applicable currently to VXLAN > tunnels. > > Please note this currently does not implement any sort of multicast > learning. With this patch, VXLAN tunnels must be configured similar > to GRE tunnels (e.g. point to point). A subsequent patch will implement > a VXLAN control plane in userspace to handle multicast learning. > > This patch set is based on one posted by Ben Pfaff on Oct. 12, 2011 > to the ovs-dev mailing list: > > http://openvswitch.org/pipermail/dev/2011-October/012051.html > > The patch has been maintained, updated, and freshened by me and a > version of it is available at the following github repository: > > https://github.com/mestery/ovs-vxlan/tree/vxlan > > I've tested this patch with multiple VXLAN tunnels between hosts > using different UDP port numbers. Performance is on par (though > slightly faster) than comparable GRE tunnels. > > See the following IETF draft for additional information about VXLAN: > http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02 > > Signed-off-by: Kyle Mestery <kmest...@cisco.com>
Thanks Kyle. I got a warning from sparse: CHECK /home/jgross/openvswitch/datapath/linux/vport-vxlan.c /home/jgross/openvswitch/datapath/linux/vport-vxlan.c:372:6: warning: symbol 'vxlan_tnl_destroy' was not declared. Should it be static > diff --git a/NEWS b/NEWS > index bb80beb..c343f3a 100644 > --- a/NEWS > +++ b/NEWS > @@ -4,6 +4,8 @@ post-v1.9.0 > > v1.9.0 - xx xxx xxxx > -------------------- > + - New support for the VXLAN tunnel protocol (see the IETF draft here: > + http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02). I think we probably won't want to put any new features into 1.9 (it's already been branched for stabilization for a while) so let's put it into the post-1.9 category. > diff --git a/datapath/vport-vxlan.c b/datapath/vport-vxlan.c > new file mode 100644 > index 0000000..88e03d5 > --- /dev/null > +++ b/datapath/vport-vxlan.c > @@ -0,0 +1,459 @@ > + /* > + * Copyright (c) 2011 Nicira, Inc. > + * Copyright (c) 2012 Cisco Systems, Inc. > + * Distributed under the terms of the GNU GPL version 2. > + * > + * Significant portions of this file may be copied from parts of the Linux > + * kernel, by Linus Torvalds and others. > + */ We've since switched to a more standard GPL header for the kernel files (for example, in datapath.c). Can you make it match? > +/* Default to the OTV port, per the VXLAN IETF draft. */ > +#define VXLAN_DST_PORT 8472 I'd prefer if we pushed the default to userspace and not have the kernel hardcode a particular port. In other words, userspace would always request a particular port (using this default if none was specified in the database) and have the kernel reject it if none was specified. > +static struct hlist_head *vxlan_hash_bucket(struct net *net, u16 port) > +{ > + unsigned int hash = jhash(&port, sizeof(port), (unsigned long) net); Since it's just one item, you can use jhash_1word() here. > +/* The below used as the min/max for the UDP port range */ > +#define VXLAN_SRC_PORT_MIN 32768 > +#define VXLAN_SRC_PORT_MAX 61000 Rather than statically defining these we can use inet_get_local_port_range() to get the ephemeral range. > +/* Compute source port for outgoing packet. > + * Currently we use the flow hash. > + */ > +static u16 get_src_port(struct sk_buff *skb) > +{ > + unsigned int range = (VXLAN_SRC_PORT_MAX - VXLAN_SRC_PORT_MIN) + 1; > + u32 hash = OVS_CB(skb)->flow->hash; > + > + return (__force u16)(((u64) hash * range) >> 32) + VXLAN_SRC_PORT_MIN; Do we actually need the __force here? It's a sparse construct and all of these look like normal C operations. > +static struct sk_buff *vxlan_build_header(const struct vport *vport, > + const struct tnl_mutable_config > *mutable, > + struct dst_entry *dst, > + struct sk_buff *skb, > + int tunnel_hlen) > +{ > + struct udphdr *udph = udp_hdr(skb); > + struct vxlanhdr *vxh = (struct vxlanhdr *)(udph + 1); > + const struct ovs_key_ipv4_tunnel *tun_key = OVS_CB(skb)->tun_key; > + __be64 out_key; > + > + if (tun_key->ipv4_dst) > + out_key = tun_key->tun_id; > + else > + out_key = mutable->out_key; I don't think this works in the case of port-based tunneling with flow keys. I would just use tnl_get_param() to make everything consistent. > +/* Called with rcu_read_lock and BH disabled. */ > +static int vxlan_rcv(struct sock *sk, struct sk_buff *skb) > +{ > + struct vport *vport; > + struct vxlanhdr *vxh; > + const struct tnl_mutable_config *mutable; > + struct iphdr *iph; > + struct ovs_key_ipv4_tunnel tun_key; > + int tunnel_type; > + __be64 key; > + u32 tunnel_flags = 0; > + > + if (unlikely(!pskb_may_pull(skb, VXLAN_HLEN + ETH_HLEN))) > + goto error; > + > + vxh = vxlan_hdr(skb); > + if (unlikely(vxh->vx_flags != htonl(VXLAN_FLAGS) || > + vxh->vx_vni & htonl(0xff))) > + goto error; > + > + __skb_pull(skb, VXLAN_HLEN); > + skb_postpull_rcsum(skb, skb_transport_header(skb), VXLAN_HLEN + > ETH_HLEN); > + > + key = cpu_to_be64(ntohl(vxh->vx_vni) >> 8); > + > + tunnel_type = TNL_T_PROTO_VXLAN; I'm not sure that this actually needs to be stored in a variable since it is only used once. > +static int vxlan_socket_init(struct vxlan_port *vxlan_port) > +{ > + int err; > + struct sockaddr_in sin; > + > + err = sock_create(AF_INET, SOCK_DGRAM, 0, > &vxlan_port->vxlan_rcv_socket); > + if (err) > + goto error; It doesn't look like this handles namespaces. I think we need to use sock_create_kern(0 and sk_change_net() like in CAPWAP. > +static int vxlan_tunnel_setup(struct net *net, const char *linkname, > + struct nlattr *options) > +{ > + struct nlattr *a[OVS_TUNNEL_ATTR_MAX + 1]; > + int err; > + u16 dst_port; > + struct vxlan_port *vxlan_port; > + struct vxlan_if *vxlan_if; > + > + if (!options) { > + err = -EINVAL; > + goto out; > + } > + > + err = nla_parse_nested(a, OVS_TUNNEL_ATTR_MAX, options, vxlan_policy); > + if (err) > + goto out; You could use nla_find_nested(), it seems to match what you are looking for here. > + > + if (a[OVS_TUNNEL_ATTR_DST_PORT]) > + dst_port = nla_get_u16(a[OVS_TUNNEL_ATTR_DST_PORT]); > + else > + dst_port = VXLAN_DST_PORT; > + > + /* Verify if we already have a socket created for this port */ > + vxlan_port = vxlan_port_exists(net, dst_port); > + if (vxlan_port) { > + vxlan_port->count++; > + err = 0; > + goto out; > + } > + > + /* Add a new socket for this port */ > + vxlan_port = kmalloc(sizeof(struct vxlan_port), GFP_KERNEL); > + if (!vxlan_port) { > + err = -ENOMEM; > + goto out; > + } > + memset (vxlan_port, 0, sizeof(struct vxlan_port)); You could combine the kmalloc() and memset() using kzalloc() instead. > + vxlan_port->port = dst_port; > + vxlan_port->count++; I think this would be clearer if you initialized it to 1 (since I think this will always be the first user). > +static int vxlan_set_options(struct vport *vport, struct nlattr *options) > +{ > + int err; > + const char *vname = vport->ops->get_name(vport); > + > + err = vxlan_tunnel_setup(ovs_dp_get_net(vport->dp), vname, options); > + if (err) > + goto out; If the dest port is changed and the refcount drops to zero, shouldn't we remove that socket? I think the best approach may be to add the new socket, call set_options, and then drop the old socket. > +void vxlan_tnl_destroy(struct vport *vport) > +{ > + struct vxlan_if *vxlan_if; > + struct vxlan_port *vxlan_port; > + const char *vname = vport->ops->get_name(vport); > + > + vxlan_if = vxlan_if_by_name(ovs_dp_get_net(vport->dp), vname); > + if (!vxlan_if) > + goto out; I think we can use tnl_vport_priv() convert the vport to a tnl_vport and pull the dst_port out of the mutable directly. Then we wouldn't have to bother with the if_name hash at all. > +static struct vport *vxlan_tnl_create(const struct vport_parms *parms) > +{ > + int err; > + > + err = vxlan_tunnel_setup(ovs_dp_get_net(parms->dp), parms->name, > + parms->options); > + return ovs_tnl_create(parms, &ovs_vxlan_vport_ops, > &ovs_vxlan_tnl_ops); If ovs_tnl_create() fails won't this result in an incorrect refcount? > +static int vxlan_init(void) > +{ [...] > + vxlan_ports = kzalloc(VXLAN_SOCK_HASH_BUCKETS * sizeof(struct > hlist_head), > + GFP_KERNEL); We probably could get away with just making this fairly small and then statically allocating it. Realistically, even just a linked list would be fine. > diff --git a/include/linux/openvswitch.h b/include/linux/openvswitch.h > index e7d4b49..2ae5681 100644 > --- a/include/linux/openvswitch.h > +++ b/include/linux/openvswitch.h > @@ -186,6 +186,7 @@ enum ovs_vport_type { > OVS_VPORT_TYPE_PATCH = 100, /* virtual tunnel connecting two vports */ > OVS_VPORT_TYPE_GRE, /* GRE tunnel */ > OVS_VPORT_TYPE_CAPWAP, /* CAPWAP tunnel */ > + OVS_VPORT_TYPE_VXLAN, /* VXLAN tunnel */ Since the plan is to get this upstream, let's put it after FT_GRE. That way we can avoid having to do any switchover. > diff --git a/lib/netdev-vport.c b/lib/netdev-vport.c > index 5171171..3dbb798 100644 > --- a/lib/netdev-vport.c > +++ b/lib/netdev-vport.c > @@ -173,6 +173,13 @@ netdev_vport_get_netdev_type(const struct > dpif_linux_vport *vport) > case OVS_VPORT_TYPE_CAPWAP: > return "capwap"; > > + case OVS_VPORT_TYPE_VXLAN: > + if (tnl_port_config_from_nlattr(vport->options, vport->options_len, > + a)) { > + break; > + } We only need to retrieve the attributes in order to check if IPsec is set, so we should be able to drop the if block here. > @@ -584,6 +591,7 @@ parse_tunnel_config(const char *name, const char *type, > const struct smap *args, struct ofpbuf *options) > { > bool is_gre = false; > + bool is_vxlan = false; I think it's probably better to move to more generic names like "has_dst_port" rather than directly tying it to the protocol (the effect would still be the same though). > @@ -835,6 +848,11 @@ unparse_tunnel_config(const char *name OVS_UNUSED, const > char *type OVS_UNUSED, > smap_add_format(args, "tos", "0x%x", tos); > } > > + if (a[OVS_TUNNEL_ATTR_DST_PORT]) { > + ovs_be16 dst_port = nl_attr_get_be16(a[OVS_TUNNEL_ATTR_DST_PORT]); I believe that everywhere else the type for dst_port is a u16 instead of be16. > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml > index c2786a5..65ead8c 100644 > --- a/vswitchd/vswitch.xml > +++ b/vswitchd/vswitch.xml > @@ -1393,9 +1412,10 @@ > deprecated and will be removed soon. > </column> > > - <group title="Tunnel Options: gre only"> > + <group title="Tunnel Options: gre and vxlan only"> > <p> > - Only <code>gre</code> interfaces support these options. > + Only <code>gre</code> and <code>vxlan</code> interfaces support > these > + options. > </p> > </group> I think we can just delete this empty group entirely now. > @@ -1427,11 +1447,19 @@ > </column> > </group> > > - <group title="Tunnel Options: ipsec_gre only"> > + <group title="Tunnel Options: ipsec_gre and ipsec_vxlan only"> > <p> > - Only <code>ipsec_gre</code> interfaces support these options. > + Only <code>ipsec_gre</code> and <code>ipsec_vxlan</code> interfaces > + support these options. > </p> > > + <p> > + These options are implemented through a separate daemon named > + <code>ovs-monitor-ipsec</code> that so far has only been ported to > + and packaged for Debian (including derivative distributions such as > + Ubuntu). > + </p> Since IPsec has been removed, I believe these changes are no longer relevant. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev