On Nov 27, 2012, at 5:13 PM, Jesse Gross <je...@nicira.com> wrote:

> On Tue, Nov 27, 2012 at 12:19 PM, Kyle Mestery <kmest...@cisco.com> wrote:
>> Note: v4 of this patch removes VXLAN over IPSEC support,
>> per an offline conversation with Jesse.
>> 
>> Add support for VXLAN tunnels to Open vSwitch. Add support
>> for setting the destination UDP port on a per-port basis.
>> This is done by adding a "dst_port" parameter to the port
>> configuration. This is only applicable currently to VXLAN
>> tunnels.
>> 
>> Please note this currently does not implement any sort of multicast
>> learning. With this patch, VXLAN tunnels must be configured similar
>> to GRE tunnels (e.g. point to point). A subsequent patch will implement
>> a VXLAN control plane in userspace to handle multicast learning.
>> 
>> This patch set is based on one posted by Ben Pfaff on Oct. 12, 2011
>> to the ovs-dev mailing list:
>> 
>> http://openvswitch.org/pipermail/dev/2011-October/012051.html
>> 
>> The patch has been maintained, updated, and freshened by me and a
>> version of it is available at the following github repository:
>> 
>> https://github.com/mestery/ovs-vxlan/tree/vxlan
>> 
>> I've tested this patch with multiple VXLAN tunnels between hosts
>> using different UDP port numbers. Performance is on par (though
>> slightly faster) than comparable GRE tunnels.
>> 
>> See the following IETF draft for additional information about VXLAN:
>> http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02
>> 
>> Signed-off-by: Kyle Mestery <kmest...@cisco.com>
> 
> Thanks Kyle.
> 
Thanks for all the comments Jesse, I'm addressing them and will send out a
new patch once it's tested again.

> I got a warning from sparse:
>  CHECK   /home/jgross/openvswitch/datapath/linux/vport-vxlan.c
> /home/jgross/openvswitch/datapath/linux/vport-vxlan.c:372:6: warning:
> symbol 'vxlan_tnl_destroy' was not declared. Should it be static
> 
>> diff --git a/NEWS b/NEWS
>> index bb80beb..c343f3a 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -4,6 +4,8 @@ post-v1.9.0
>> 
>> v1.9.0 - xx xxx xxxx
>> --------------------
>> +    - New support for the VXLAN tunnel protocol (see the IETF draft here:
>> +      http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02).
> 
> I think we probably won't want to put any new features into 1.9 (it's
> already been branched for stabilization for a while) so let's put it
> into the post-1.9 category.
> 
>> diff --git a/datapath/vport-vxlan.c b/datapath/vport-vxlan.c
>> new file mode 100644
>> index 0000000..88e03d5
>> --- /dev/null
>> +++ b/datapath/vport-vxlan.c
>> @@ -0,0 +1,459 @@
>> + /*
>> + * Copyright (c) 2011 Nicira, Inc.
>> + * Copyright (c) 2012 Cisco Systems, Inc.
>> + * Distributed under the terms of the GNU GPL version 2.
>> + *
>> + * Significant portions of this file may be copied from parts of the Linux
>> + * kernel, by Linus Torvalds and others.
>> + */
> 
> We've since switched to a more standard GPL header for the kernel
> files (for example, in datapath.c).  Can you make it match?
> 
>> +/* Default to the OTV port, per the VXLAN IETF draft. */
>> +#define VXLAN_DST_PORT 8472
> 
> I'd prefer if we pushed the default to userspace and not have the
> kernel hardcode a particular port.  In other words, userspace would
> always request a particular port (using this default if none was
> specified in the database) and have the kernel reject it if none was
> specified.
> 
>> +static struct hlist_head *vxlan_hash_bucket(struct net *net, u16 port)
>> +{
>> +       unsigned int hash = jhash(&port, sizeof(port), (unsigned long) net);
> 
> Since it's just one item, you can use jhash_1word() here.
> 
>> +/* The below used as the min/max for the UDP port range */
>> +#define VXLAN_SRC_PORT_MIN      32768
>> +#define VXLAN_SRC_PORT_MAX      61000
> 
> Rather than statically defining these we can use
> inet_get_local_port_range() to get the ephemeral range.
> 
>> +/* Compute source port for outgoing packet.
>> + * Currently we use the flow hash.
>> + */
>> +static u16 get_src_port(struct sk_buff *skb)
>> +{
>> +       unsigned int range = (VXLAN_SRC_PORT_MAX - VXLAN_SRC_PORT_MIN) + 1;
>> +       u32 hash = OVS_CB(skb)->flow->hash;
>> +
>> +       return (__force u16)(((u64) hash * range) >> 32) + 
>> VXLAN_SRC_PORT_MIN;
> 
> Do we actually need the __force here?  It's a sparse construct and all
> of these look like normal C operations.
> 
>> +static struct sk_buff *vxlan_build_header(const struct vport *vport,
>> +                                         const struct tnl_mutable_config 
>> *mutable,
>> +                                         struct dst_entry *dst,
>> +                                         struct sk_buff *skb,
>> +                                         int tunnel_hlen)
>> +{
>> +       struct udphdr *udph = udp_hdr(skb);
>> +       struct vxlanhdr *vxh = (struct vxlanhdr *)(udph + 1);
>> +       const struct ovs_key_ipv4_tunnel *tun_key = OVS_CB(skb)->tun_key;
>> +       __be64 out_key;
>> +
>> +       if (tun_key->ipv4_dst)
>> +               out_key = tun_key->tun_id;
>> +       else
>> +               out_key = mutable->out_key;
> 
> I don't think this works in the case of port-based tunneling with flow
> keys.  I would just use tnl_get_param() to make everything consistent.
> 
>> +/* Called with rcu_read_lock and BH disabled. */
>> +static int vxlan_rcv(struct sock *sk, struct sk_buff *skb)
>> +{
>> +       struct vport *vport;
>> +       struct vxlanhdr *vxh;
>> +       const struct tnl_mutable_config *mutable;
>> +       struct iphdr *iph;
>> +       struct ovs_key_ipv4_tunnel tun_key;
>> +       int tunnel_type;
>> +       __be64 key;
>> +       u32 tunnel_flags = 0;
>> +
>> +       if (unlikely(!pskb_may_pull(skb, VXLAN_HLEN + ETH_HLEN)))
>> +               goto error;
>> +
>> +       vxh = vxlan_hdr(skb);
>> +       if (unlikely(vxh->vx_flags != htonl(VXLAN_FLAGS) ||
>> +                    vxh->vx_vni & htonl(0xff)))
>> +               goto error;
>> +
>> +       __skb_pull(skb, VXLAN_HLEN);
>> +       skb_postpull_rcsum(skb, skb_transport_header(skb), VXLAN_HLEN + 
>> ETH_HLEN);
>> +
>> +       key = cpu_to_be64(ntohl(vxh->vx_vni) >> 8);
>> +
>> +       tunnel_type = TNL_T_PROTO_VXLAN;
> 
> I'm not sure that this actually needs to be stored in a variable since
> it is only used once.
> 
>> +static int vxlan_socket_init(struct vxlan_port *vxlan_port)
>> +{
>> +       int err;
>> +       struct sockaddr_in sin;
>> +
>> +       err = sock_create(AF_INET, SOCK_DGRAM, 0, 
>> &vxlan_port->vxlan_rcv_socket);
>> +       if (err)
>> +               goto error;
> 
> It doesn't look like this handles namespaces.  I think we need to use
> sock_create_kern(0 and sk_change_net() like in CAPWAP.
> 
>> +static int vxlan_tunnel_setup(struct net *net, const char *linkname,
>> +                            struct nlattr *options)
>> +{
>> +       struct nlattr *a[OVS_TUNNEL_ATTR_MAX + 1];
>> +       int err;
>> +       u16 dst_port;
>> +       struct vxlan_port *vxlan_port;
>> +       struct vxlan_if *vxlan_if;
>> +
>> +       if (!options) {
>> +               err = -EINVAL;
>> +               goto out;
>> +       }
>> +
>> +       err = nla_parse_nested(a, OVS_TUNNEL_ATTR_MAX, options, 
>> vxlan_policy);
>> +       if (err)
>> +               goto out;
> 
> You could use nla_find_nested(), it seems to match what you are
> looking for here.
> 
>> +
>> +       if (a[OVS_TUNNEL_ATTR_DST_PORT])
>> +               dst_port = nla_get_u16(a[OVS_TUNNEL_ATTR_DST_PORT]);
>> +       else
>> +               dst_port = VXLAN_DST_PORT;
>> +
>> +       /* Verify if we already have a socket created for this port */
>> +       vxlan_port = vxlan_port_exists(net, dst_port);
>> +       if (vxlan_port) {
>> +               vxlan_port->count++;
>> +               err = 0;
>> +               goto out;
>> +       }
>> +
>> +       /* Add a new socket for this port */
>> +       vxlan_port = kmalloc(sizeof(struct vxlan_port), GFP_KERNEL);
>> +       if (!vxlan_port) {
>> +               err = -ENOMEM;
>> +               goto out;
>> +       }
>> +       memset (vxlan_port, 0, sizeof(struct vxlan_port));
> 
> You could combine the kmalloc() and memset() using kzalloc() instead.
> 
>> +       vxlan_port->port = dst_port;
>> +       vxlan_port->count++;
> 
> I think this would be clearer if you initialized it to 1 (since I
> think this will always be the first user).
> 
>> +static int vxlan_set_options(struct vport *vport, struct nlattr *options)
>> +{
>> +       int err;
>> +       const char *vname = vport->ops->get_name(vport);
>> +
>> +       err = vxlan_tunnel_setup(ovs_dp_get_net(vport->dp), vname, options);
>> +       if (err)
>> +               goto out;
> 
> If the dest port is changed and the refcount drops to zero, shouldn't
> we remove that socket?  I think the best approach may be to add the
> new socket, call set_options, and then drop the old socket.
> 
>> +void vxlan_tnl_destroy(struct vport *vport)
>> +{
>> +       struct vxlan_if *vxlan_if;
>> +       struct vxlan_port *vxlan_port;
>> +       const char *vname = vport->ops->get_name(vport);
>> +
>> +       vxlan_if = vxlan_if_by_name(ovs_dp_get_net(vport->dp), vname);
>> +       if (!vxlan_if)
>> +               goto out;
> 
> I think we can use tnl_vport_priv() convert the vport to a tnl_vport
> and pull the dst_port out of the mutable directly.  Then we wouldn't
> have to bother with the if_name hash at all.
> 
>> +static struct vport *vxlan_tnl_create(const struct vport_parms *parms)
>> +{
>> +       int err;
>> +
>> +       err = vxlan_tunnel_setup(ovs_dp_get_net(parms->dp), parms->name,
>> +                                               parms->options);
>> +       return ovs_tnl_create(parms, &ovs_vxlan_vport_ops, 
>> &ovs_vxlan_tnl_ops);
> 
> If ovs_tnl_create() fails won't this result in an incorrect refcount?
> 
>> +static int vxlan_init(void)
>> +{
> [...]
>> +       vxlan_ports = kzalloc(VXLAN_SOCK_HASH_BUCKETS * sizeof(struct 
>> hlist_head),
>> +                               GFP_KERNEL);
> 
> We probably could get away with just making this fairly small and then
> statically allocating it.  Realistically, even just a linked list
> would be fine.
> 
>> diff --git a/include/linux/openvswitch.h b/include/linux/openvswitch.h
>> index e7d4b49..2ae5681 100644
>> --- a/include/linux/openvswitch.h
>> +++ b/include/linux/openvswitch.h
>> @@ -186,6 +186,7 @@ enum ovs_vport_type {
>>        OVS_VPORT_TYPE_PATCH = 100, /* virtual tunnel connecting two vports */
>>        OVS_VPORT_TYPE_GRE,      /* GRE tunnel */
>>        OVS_VPORT_TYPE_CAPWAP,   /* CAPWAP tunnel */
>> +       OVS_VPORT_TYPE_VXLAN,    /* VXLAN tunnel */
> 
> Since the plan is to get this upstream, let's put it after FT_GRE.
> That way we can avoid having to do any switchover.
> 
>> diff --git a/lib/netdev-vport.c b/lib/netdev-vport.c
>> index 5171171..3dbb798 100644
>> --- a/lib/netdev-vport.c
>> +++ b/lib/netdev-vport.c
>> @@ -173,6 +173,13 @@ netdev_vport_get_netdev_type(const struct 
>> dpif_linux_vport *vport)
>>     case OVS_VPORT_TYPE_CAPWAP:
>>         return "capwap";
>> 
>> +    case OVS_VPORT_TYPE_VXLAN:
>> +        if (tnl_port_config_from_nlattr(vport->options, vport->options_len,
>> +                                        a)) {
>> +            break;
>> +        }
> 
> We only need to retrieve the attributes in order to check if IPsec is
> set, so we should be able to drop the if block here.
> 
>> @@ -584,6 +591,7 @@ parse_tunnel_config(const char *name, const char *type,
>>                     const struct smap *args, struct ofpbuf *options)
>> {
>>     bool is_gre = false;
>> +    bool is_vxlan = false;
> 
> I think it's probably better to move to more generic names like
> "has_dst_port" rather than directly tying it to the protocol (the
> effect would still be the same though).
> 
>> @@ -835,6 +848,11 @@ unparse_tunnel_config(const char *name OVS_UNUSED, 
>> const char *type OVS_UNUSED,
>>         smap_add_format(args, "tos", "0x%x", tos);
>>     }
>> 
>> +    if (a[OVS_TUNNEL_ATTR_DST_PORT]) {
>> +        ovs_be16 dst_port = nl_attr_get_be16(a[OVS_TUNNEL_ATTR_DST_PORT]);
> 
> I believe that everywhere else the type for dst_port is a u16 instead of be16.
> 
>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>> index c2786a5..65ead8c 100644
>> --- a/vswitchd/vswitch.xml
>> +++ b/vswitchd/vswitch.xml
>> @@ -1393,9 +1412,10 @@
>>         deprecated and will be removed soon.
>>       </column>
>> 
>> -      <group title="Tunnel Options: gre only">
>> +      <group title="Tunnel Options: gre and vxlan only">
>>         <p>
>> -          Only <code>gre</code> interfaces support these options.
>> +          Only <code>gre</code> and <code>vxlan</code> interfaces support 
>> these
>> +          options.
>>         </p>
>>       </group>
> 
> I think we can just delete this empty group entirely now.
> 
>> @@ -1427,11 +1447,19 @@
>>         </column>
>>       </group>
>> 
>> -      <group title="Tunnel Options: ipsec_gre only">
>> +      <group title="Tunnel Options: ipsec_gre and ipsec_vxlan only">
>>         <p>
>> -          Only <code>ipsec_gre</code> interfaces support these options.
>> +          Only <code>ipsec_gre</code> and <code>ipsec_vxlan</code> 
>> interfaces
>> +          support these options.
>>         </p>
>> 
>> +       <p>
>> +         These options are implemented through a separate daemon named
>> +         <code>ovs-monitor-ipsec</code> that so far has only been ported to
>> +         and packaged for Debian (including derivative distributions such as
>> +         Ubuntu).
>> +       </p>
> 
> Since IPsec has been removed, I believe these changes are no longer relevant.

_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to