On Nov 27, 2012, at 5:13 PM, Jesse Gross <je...@nicira.com> wrote: > On Tue, Nov 27, 2012 at 12:19 PM, Kyle Mestery <kmest...@cisco.com> wrote: >> Note: v4 of this patch removes VXLAN over IPSEC support, >> per an offline conversation with Jesse. >> >> Add support for VXLAN tunnels to Open vSwitch. Add support >> for setting the destination UDP port on a per-port basis. >> This is done by adding a "dst_port" parameter to the port >> configuration. This is only applicable currently to VXLAN >> tunnels. >> >> Please note this currently does not implement any sort of multicast >> learning. With this patch, VXLAN tunnels must be configured similar >> to GRE tunnels (e.g. point to point). A subsequent patch will implement >> a VXLAN control plane in userspace to handle multicast learning. >> >> This patch set is based on one posted by Ben Pfaff on Oct. 12, 2011 >> to the ovs-dev mailing list: >> >> http://openvswitch.org/pipermail/dev/2011-October/012051.html >> >> The patch has been maintained, updated, and freshened by me and a >> version of it is available at the following github repository: >> >> https://github.com/mestery/ovs-vxlan/tree/vxlan >> >> I've tested this patch with multiple VXLAN tunnels between hosts >> using different UDP port numbers. Performance is on par (though >> slightly faster) than comparable GRE tunnels. >> >> See the following IETF draft for additional information about VXLAN: >> http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02 >> >> Signed-off-by: Kyle Mestery <kmest...@cisco.com> > > Thanks Kyle. > Thanks for all the comments Jesse, I'm addressing them and will send out a new patch once it's tested again.
> I got a warning from sparse: > CHECK /home/jgross/openvswitch/datapath/linux/vport-vxlan.c > /home/jgross/openvswitch/datapath/linux/vport-vxlan.c:372:6: warning: > symbol 'vxlan_tnl_destroy' was not declared. Should it be static > >> diff --git a/NEWS b/NEWS >> index bb80beb..c343f3a 100644 >> --- a/NEWS >> +++ b/NEWS >> @@ -4,6 +4,8 @@ post-v1.9.0 >> >> v1.9.0 - xx xxx xxxx >> -------------------- >> + - New support for the VXLAN tunnel protocol (see the IETF draft here: >> + http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02). > > I think we probably won't want to put any new features into 1.9 (it's > already been branched for stabilization for a while) so let's put it > into the post-1.9 category. > >> diff --git a/datapath/vport-vxlan.c b/datapath/vport-vxlan.c >> new file mode 100644 >> index 0000000..88e03d5 >> --- /dev/null >> +++ b/datapath/vport-vxlan.c >> @@ -0,0 +1,459 @@ >> + /* >> + * Copyright (c) 2011 Nicira, Inc. >> + * Copyright (c) 2012 Cisco Systems, Inc. >> + * Distributed under the terms of the GNU GPL version 2. >> + * >> + * Significant portions of this file may be copied from parts of the Linux >> + * kernel, by Linus Torvalds and others. >> + */ > > We've since switched to a more standard GPL header for the kernel > files (for example, in datapath.c). Can you make it match? > >> +/* Default to the OTV port, per the VXLAN IETF draft. */ >> +#define VXLAN_DST_PORT 8472 > > I'd prefer if we pushed the default to userspace and not have the > kernel hardcode a particular port. In other words, userspace would > always request a particular port (using this default if none was > specified in the database) and have the kernel reject it if none was > specified. > >> +static struct hlist_head *vxlan_hash_bucket(struct net *net, u16 port) >> +{ >> + unsigned int hash = jhash(&port, sizeof(port), (unsigned long) net); > > Since it's just one item, you can use jhash_1word() here. > >> +/* The below used as the min/max for the UDP port range */ >> +#define VXLAN_SRC_PORT_MIN 32768 >> +#define VXLAN_SRC_PORT_MAX 61000 > > Rather than statically defining these we can use > inet_get_local_port_range() to get the ephemeral range. > >> +/* Compute source port for outgoing packet. >> + * Currently we use the flow hash. >> + */ >> +static u16 get_src_port(struct sk_buff *skb) >> +{ >> + unsigned int range = (VXLAN_SRC_PORT_MAX - VXLAN_SRC_PORT_MIN) + 1; >> + u32 hash = OVS_CB(skb)->flow->hash; >> + >> + return (__force u16)(((u64) hash * range) >> 32) + >> VXLAN_SRC_PORT_MIN; > > Do we actually need the __force here? It's a sparse construct and all > of these look like normal C operations. > >> +static struct sk_buff *vxlan_build_header(const struct vport *vport, >> + const struct tnl_mutable_config >> *mutable, >> + struct dst_entry *dst, >> + struct sk_buff *skb, >> + int tunnel_hlen) >> +{ >> + struct udphdr *udph = udp_hdr(skb); >> + struct vxlanhdr *vxh = (struct vxlanhdr *)(udph + 1); >> + const struct ovs_key_ipv4_tunnel *tun_key = OVS_CB(skb)->tun_key; >> + __be64 out_key; >> + >> + if (tun_key->ipv4_dst) >> + out_key = tun_key->tun_id; >> + else >> + out_key = mutable->out_key; > > I don't think this works in the case of port-based tunneling with flow > keys. I would just use tnl_get_param() to make everything consistent. > >> +/* Called with rcu_read_lock and BH disabled. */ >> +static int vxlan_rcv(struct sock *sk, struct sk_buff *skb) >> +{ >> + struct vport *vport; >> + struct vxlanhdr *vxh; >> + const struct tnl_mutable_config *mutable; >> + struct iphdr *iph; >> + struct ovs_key_ipv4_tunnel tun_key; >> + int tunnel_type; >> + __be64 key; >> + u32 tunnel_flags = 0; >> + >> + if (unlikely(!pskb_may_pull(skb, VXLAN_HLEN + ETH_HLEN))) >> + goto error; >> + >> + vxh = vxlan_hdr(skb); >> + if (unlikely(vxh->vx_flags != htonl(VXLAN_FLAGS) || >> + vxh->vx_vni & htonl(0xff))) >> + goto error; >> + >> + __skb_pull(skb, VXLAN_HLEN); >> + skb_postpull_rcsum(skb, skb_transport_header(skb), VXLAN_HLEN + >> ETH_HLEN); >> + >> + key = cpu_to_be64(ntohl(vxh->vx_vni) >> 8); >> + >> + tunnel_type = TNL_T_PROTO_VXLAN; > > I'm not sure that this actually needs to be stored in a variable since > it is only used once. > >> +static int vxlan_socket_init(struct vxlan_port *vxlan_port) >> +{ >> + int err; >> + struct sockaddr_in sin; >> + >> + err = sock_create(AF_INET, SOCK_DGRAM, 0, >> &vxlan_port->vxlan_rcv_socket); >> + if (err) >> + goto error; > > It doesn't look like this handles namespaces. I think we need to use > sock_create_kern(0 and sk_change_net() like in CAPWAP. > >> +static int vxlan_tunnel_setup(struct net *net, const char *linkname, >> + struct nlattr *options) >> +{ >> + struct nlattr *a[OVS_TUNNEL_ATTR_MAX + 1]; >> + int err; >> + u16 dst_port; >> + struct vxlan_port *vxlan_port; >> + struct vxlan_if *vxlan_if; >> + >> + if (!options) { >> + err = -EINVAL; >> + goto out; >> + } >> + >> + err = nla_parse_nested(a, OVS_TUNNEL_ATTR_MAX, options, >> vxlan_policy); >> + if (err) >> + goto out; > > You could use nla_find_nested(), it seems to match what you are > looking for here. > >> + >> + if (a[OVS_TUNNEL_ATTR_DST_PORT]) >> + dst_port = nla_get_u16(a[OVS_TUNNEL_ATTR_DST_PORT]); >> + else >> + dst_port = VXLAN_DST_PORT; >> + >> + /* Verify if we already have a socket created for this port */ >> + vxlan_port = vxlan_port_exists(net, dst_port); >> + if (vxlan_port) { >> + vxlan_port->count++; >> + err = 0; >> + goto out; >> + } >> + >> + /* Add a new socket for this port */ >> + vxlan_port = kmalloc(sizeof(struct vxlan_port), GFP_KERNEL); >> + if (!vxlan_port) { >> + err = -ENOMEM; >> + goto out; >> + } >> + memset (vxlan_port, 0, sizeof(struct vxlan_port)); > > You could combine the kmalloc() and memset() using kzalloc() instead. > >> + vxlan_port->port = dst_port; >> + vxlan_port->count++; > > I think this would be clearer if you initialized it to 1 (since I > think this will always be the first user). > >> +static int vxlan_set_options(struct vport *vport, struct nlattr *options) >> +{ >> + int err; >> + const char *vname = vport->ops->get_name(vport); >> + >> + err = vxlan_tunnel_setup(ovs_dp_get_net(vport->dp), vname, options); >> + if (err) >> + goto out; > > If the dest port is changed and the refcount drops to zero, shouldn't > we remove that socket? I think the best approach may be to add the > new socket, call set_options, and then drop the old socket. > >> +void vxlan_tnl_destroy(struct vport *vport) >> +{ >> + struct vxlan_if *vxlan_if; >> + struct vxlan_port *vxlan_port; >> + const char *vname = vport->ops->get_name(vport); >> + >> + vxlan_if = vxlan_if_by_name(ovs_dp_get_net(vport->dp), vname); >> + if (!vxlan_if) >> + goto out; > > I think we can use tnl_vport_priv() convert the vport to a tnl_vport > and pull the dst_port out of the mutable directly. Then we wouldn't > have to bother with the if_name hash at all. > >> +static struct vport *vxlan_tnl_create(const struct vport_parms *parms) >> +{ >> + int err; >> + >> + err = vxlan_tunnel_setup(ovs_dp_get_net(parms->dp), parms->name, >> + parms->options); >> + return ovs_tnl_create(parms, &ovs_vxlan_vport_ops, >> &ovs_vxlan_tnl_ops); > > If ovs_tnl_create() fails won't this result in an incorrect refcount? > >> +static int vxlan_init(void) >> +{ > [...] >> + vxlan_ports = kzalloc(VXLAN_SOCK_HASH_BUCKETS * sizeof(struct >> hlist_head), >> + GFP_KERNEL); > > We probably could get away with just making this fairly small and then > statically allocating it. Realistically, even just a linked list > would be fine. > >> diff --git a/include/linux/openvswitch.h b/include/linux/openvswitch.h >> index e7d4b49..2ae5681 100644 >> --- a/include/linux/openvswitch.h >> +++ b/include/linux/openvswitch.h >> @@ -186,6 +186,7 @@ enum ovs_vport_type { >> OVS_VPORT_TYPE_PATCH = 100, /* virtual tunnel connecting two vports */ >> OVS_VPORT_TYPE_GRE, /* GRE tunnel */ >> OVS_VPORT_TYPE_CAPWAP, /* CAPWAP tunnel */ >> + OVS_VPORT_TYPE_VXLAN, /* VXLAN tunnel */ > > Since the plan is to get this upstream, let's put it after FT_GRE. > That way we can avoid having to do any switchover. > >> diff --git a/lib/netdev-vport.c b/lib/netdev-vport.c >> index 5171171..3dbb798 100644 >> --- a/lib/netdev-vport.c >> +++ b/lib/netdev-vport.c >> @@ -173,6 +173,13 @@ netdev_vport_get_netdev_type(const struct >> dpif_linux_vport *vport) >> case OVS_VPORT_TYPE_CAPWAP: >> return "capwap"; >> >> + case OVS_VPORT_TYPE_VXLAN: >> + if (tnl_port_config_from_nlattr(vport->options, vport->options_len, >> + a)) { >> + break; >> + } > > We only need to retrieve the attributes in order to check if IPsec is > set, so we should be able to drop the if block here. > >> @@ -584,6 +591,7 @@ parse_tunnel_config(const char *name, const char *type, >> const struct smap *args, struct ofpbuf *options) >> { >> bool is_gre = false; >> + bool is_vxlan = false; > > I think it's probably better to move to more generic names like > "has_dst_port" rather than directly tying it to the protocol (the > effect would still be the same though). > >> @@ -835,6 +848,11 @@ unparse_tunnel_config(const char *name OVS_UNUSED, >> const char *type OVS_UNUSED, >> smap_add_format(args, "tos", "0x%x", tos); >> } >> >> + if (a[OVS_TUNNEL_ATTR_DST_PORT]) { >> + ovs_be16 dst_port = nl_attr_get_be16(a[OVS_TUNNEL_ATTR_DST_PORT]); > > I believe that everywhere else the type for dst_port is a u16 instead of be16. > >> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml >> index c2786a5..65ead8c 100644 >> --- a/vswitchd/vswitch.xml >> +++ b/vswitchd/vswitch.xml >> @@ -1393,9 +1412,10 @@ >> deprecated and will be removed soon. >> </column> >> >> - <group title="Tunnel Options: gre only"> >> + <group title="Tunnel Options: gre and vxlan only"> >> <p> >> - Only <code>gre</code> interfaces support these options. >> + Only <code>gre</code> and <code>vxlan</code> interfaces support >> these >> + options. >> </p> >> </group> > > I think we can just delete this empty group entirely now. > >> @@ -1427,11 +1447,19 @@ >> </column> >> </group> >> >> - <group title="Tunnel Options: ipsec_gre only"> >> + <group title="Tunnel Options: ipsec_gre and ipsec_vxlan only"> >> <p> >> - Only <code>ipsec_gre</code> interfaces support these options. >> + Only <code>ipsec_gre</code> and <code>ipsec_vxlan</code> >> interfaces >> + support these options. >> </p> >> >> + <p> >> + These options are implemented through a separate daemon named >> + <code>ovs-monitor-ipsec</code> that so far has only been ported to >> + and packaged for Debian (including derivative distributions such as >> + Ubuntu). >> + </p> > > Since IPsec has been removed, I believe these changes are no longer relevant. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev