Thanks and sorry that I took so long to get back to you. I thought about the layering a fair amount and am still somewhat bothered by the flow of things. I think if we make a couple small modifications then it will fit more cleanly: * I don't think that we really need ip_tunnel_xmit(). It's primarily there to handle offloads but if we get in the GSO changes before allowing upper layer protocols to send these offloaded packets then we don't have to worry about it. Also, the send_frags() function is there due to CAPWAP's protocol level fragmentation, which we don't have to worry about here. Without these two we can call directly into the GRE data plane code and have it do the complete process without much in the way of code duplication. * Registering to receive packets should probably into the GRE data plane module instead of directly to the IP tunnel code. There's potentially a module dependency problem otherwise - if all you do is register to receive packets then there's nothing that will actually cause the GRE module to get loaded. This is resolved at the moment by the use of the transmit function but it seems a little fragile. We can definitely still use the IP tunnel code and share structures, etc. but we would just add an extra hop.
At this point the upper tunneling interface would look something like this: gre_register() gre_unregister() gre_send() It's obviously protocol specific but I think at this point that it's inevitable and this way at least it is consistent. I'm also aware that it makes the IP tunnel code more of a library than I was proposing - now that I can see the code I can understand what you were saying that it makes more sense this way. On Thu, Nov 22, 2012 at 7:56 AM, Pravin B Shelar <pshe...@nicira.com> wrote: > diff --git a/include/net/gre.h b/include/net/gre.h > index 8266547..dda547a 100644 > --- a/include/net/gre.h > +++ b/include/net/gre.h > @@ -15,4 +16,12 @@ struct gre_protocol { > int gre_add_protocol(const struct gre_protocol *proto, u8 version); > int gre_del_protocol(const struct gre_protocol *proto, u8 version); > > +struct sk_buff *gre_build_header(struct sk_buff *skb, > + const struct tnl_ptk_info *tpi); > +struct gre_base_hdr { > + __be16 flags; > + __be16 protocol; > +}; > +#define GRE_HEADER_SECTION 4 I think these are only used in gre.c, so maybe we can keep the definitions private to that (since ideally no other code will need to have these protocol details). > diff --git a/include/net/ipip.h b/include/net/ipip.h > index 21947cf..a14bde8 100644 > --- a/include/net/ipip.h > +++ b/include/net/ipip.h What do you think about renaming this to ip_tunnel.h or similar? It seems like too specific of a name at this point. > +#define TUNNEL_CSUM __cpu_to_be16(0x8000) > +#define TUNNEL_KEY __cpu_to_be16(0x2000) > +#define TUNNEL_SEQ __cpu_to_be16(0x1000) I don't think that we really want to tie these directly to GRE flags. As we add more protocols it will just get more complicated. > +struct tnl_ptk_info { > + __be16 flags; > + __be16 proto; > + __be32 key; > + __be32 seq; > + int hdr_len; > + __sum16 csum; Do we actually need the hdr_len and csum fields. hdr_len seems like it ties the control plane to the data plane a little too closely. For csum, can't we directly drop packets with invalid checksums when we parse the header? There aren't any current users of packets with invalid checksums and none really come to mind in the future. > +#define HASH_ON_KEY 1 I think this isn't used until later with VXLAN so maybe we should wait until then. > +struct ip_tunnel_ops { > + u32 flags; > + int (*parse_netlink_parms)(struct ip_tunnel *vxlan, > + struct nlattr *data[], > + struct nlattr *tb[], > + struct ip_tunnel_parm *parms); > + int (*get_ioctl_param)(struct net_device *dev, > + struct ifreq *ifr, > + struct ip_tunnel_parm *p); > +}; > + > +#define IPT_HASH_BITS 10 > +#define IPT_HASH_SIZE (1 << IPT_HASH_BITS) > + > +struct ip_tunnel_net { > + struct ip_tunnel __rcu **tunnels; > + struct net_device *fb_tunnel_dev; > + const struct ip_tunnel_ops *ops; > +}; > + > +enum ipt_type { > + IPT_GRE, > + IPT_VXLAN, Probably the VXLAN identifier should come later when VXLAN starts to actually use this infrastructure. > +#define IPT_HASH_BUCKETS 16 > +extern struct hlist_head ipt_proto[IPT_HASH_BUCKETS]; The names for IPT_HASH_BUCKETS and IPT_HASH_SIZE are very similar, we probably should choose something more descriptive. > +static inline void ip_tunnel_setup(struct net_device *dev, int net_id) > +{ > + struct ip_tunnel *tunnel = netdev_priv(dev); > + tunnel->ipt_net_id = net_id; > +} Does this need to be an inline? We could probably just make it consistent. > diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig > index 5a19aeb..4110ff3 100644 > --- a/net/ipv4/Kconfig > +++ b/net/ipv4/Kconfig > @@ -166,6 +166,7 @@ config IP_PNP_RARP > config NET_IPIP > tristate "IP: tunneling" > select INET_TUNNEL > + select NET_IP_TUNNEL IPIP isn't using anything from here yet, right? > diff --git a/net/ipv4/gre.c b/net/ipv4/gre.c > index 5a903dc..b837551 100644 > --- a/net/ipv4/gre.c > +++ b/net/ipv4/gre.c > +struct sk_buff *gre_build_header(struct sk_buff *skb, > + const struct tnl_ptk_info *tpi) > +{ > + struct iphdr *iph = ip_hdr(skb); > + struct gre_base_hdr *greh = (struct gre_base_hdr *)&iph[1]; Can we reverse the order that we build the headers so that GRE comes before the IP header? That way you could just push each incremental piece on, rather than having to do all of the pointer manipulation. > + struct dst_entry *dst = skb_dst(skb); > + > + greh->flags = tpi->flags; > + greh->protocol = tpi->proto; > + if (tpi->flags&(GRE_KEY|GRE_CSUM|GRE_SEQ)) { > + __be32 *ptr = (__be32 *)(((u8 *)greh) + tpi->hdr_len - 4); > + > + if (tpi->flags&GRE_SEQ) { > + *ptr = tpi->seq; > + ptr--; > + } > + if (tpi->flags&GRE_KEY) { > + *ptr = tpi->key; > + ptr--; > + } > + if (tpi->flags&GRE_CSUM) { > + *(__sum16 *)ptr = 0; > + *(__sum16 *)ptr = csum_fold(skb_checksum(skb, > + skb_transport_offset(skb), > + skb->len - > skb_transport_offset(skb), > + 0)); > + } > + } > + skb->local_df = 1; > + __ip_select_ident(ip_hdr(skb), dst, 0); We should just fix the fragmentation code to select an ID if necessary rather than keeping this workaround. > + return skb; > +} > +EXPORT_SYMBOL(gre_build_header); > + > + Extra blank line. > +static int parse_gre_header(struct sk_buff *skb, struct tnl_ptk_info *tpi) > +{ > + struct gre_base_hdr *greh = (struct gre_base_hdr *)skb->data; > + __be32 *options = (__be32 *)(greh + 1); > + > + if (unlikely(greh->flags & (GRE_VERSION | GRE_ROUTING))) > + return -EINVAL; > + > + tpi->flags = greh->flags; > + tpi->proto = greh->protocol; > + > + tpi->hdr_len = GRE_HEADER_SECTION; > + tpi->csum = check_checksum(skb); > + > + if (tpi->csum) > + return -EINVAL; > + > + if (greh->flags & GRE_CSUM) { > + tpi->hdr_len += GRE_HEADER_SECTION; > + options++; > + } Can we incrementally pull each piece off so that we don't have to pass hdr_len around? > +static int ipgre_rcv_v0(struct sk_buff *skb) > +{ > + struct tnl_ptk_info tpi; > + struct ipt_protocol *proto; > + struct hlist_node *n; > + struct hlist_head *head; > + > + if (!pskb_may_pull(skb, 16)) > + goto drop; > + > + if (parse_gre_header(skb, &tpi) < 0) > + goto drop; > + > + head = ipt_hash_bucket(IPT_GRE, 0); > + hlist_for_each_entry_rcu(proto, n, head, node) { > + int ret; > + > + if (proto->type != IPT_GRE || proto->portno != 0) > + continue; > + ret = proto->handler(skb, &tpi); It seems that it is very likely that users of this API will have the same handler for all port numbers (and running different implementation on different ports at the same time seems not that common either). In that case, it would be easier to just pass the port number to the handler and let it do its own lookup. Also, does it make sense to pull the complete hash and lookup block out into ip_tunnel.c? > +static void ipgre_err_v0(struct sk_buff *skb, u32 info) > +{ > + > + /* All the routers (except for Linux) return only > + * 8 bytes of packet payload. It means, that precise relaying of > + * ICMP in the real Internet is absolutely infeasible. > + * > + * Moreover, Cisco "wise men" put GRE key to the third word > + * in GRE header. It makes impossible maintaining even soft > + * state for keyed > + * GRE tunnels with enabled checksum. Tell them "thank you". > + * > + * Well, I wonder, rfc1812 was written by Cisco employee, > + * what the hell these idiots break standards established > + * by themselves??? > + */ > + > + const int type = icmp_hdr(skb)->type; > + const int code = icmp_hdr(skb)->code; > + struct tnl_ptk_info tpi; > + struct hlist_node *n; > + struct hlist_head *head; > + struct ipt_protocol *proto; > + > + if (!pskb_may_pull(skb, sizeof(struct gre_base_hdr) + ETH_HLEN)) > + return; This code is shared by both encapsulated IP and Ethernet so we should be careful not to make any assumptions there. > + parse_gre_header(skb, &tpi); > + > + if (tpi.csum) > + return; I don't think that we can validate the checksum here. It's likely that the ICMP message will only return a portion of the GRE packet so we'll falsely think the checksum is wrong. > + /* If only 8 bytes returned, keyed message will be dropped here */ > + if (tpi.flags & GRE_KEY) { > + if ((tpi.flags & GRE_CSUM) && (tpi.hdr_len < 12)) > + return; > + if (tpi.hdr_len < 8) > + return; > + } > + > + switch (type) { > + default: > + case ICMP_PARAMETERPROB: > + return; > + > + case ICMP_DEST_UNREACH: > + switch (code) { > + case ICMP_SR_FAILED: > + case ICMP_PORT_UNREACH: > + /* Impossible event. */ > + return; > + default: > + /* All others are translated to HOST_UNREACH. > + rfc2003 contains "deep thoughts" about NET_UNREACH, > + I believe they are just ether pollution. --ANK > + */ > + break; > + } > + break; > + case ICMP_TIME_EXCEEDED: > + if (code != ICMP_EXC_TTL) > + return; > + break; > + > + case ICMP_REDIRECT: > + break; > + } > + > + head = ipt_hash_bucket(IPT_GRE, 0); > + hlist_for_each_entry_rcu(proto, n, head, node) { > + if (proto->type != IPT_GRE || proto->portno != 0) > + continue; > + if (proto->err_handler(skb, info, &tpi) <= 0) > + return; > + } I think we can move the dispatch lookup above the type check to avoid imposing a policy on the ultimate handler. That said, can we continue to handle path MTU discovery and ICMP redirects at this level? It doesn't really depend on the tunnel device and seems like a common data plane component. > diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c > index 0d4eecd..829fe3d 100644 > --- a/net/ipv4/ip_gre.c > +++ b/net/ipv4/ip_gre.c > /* Fallback tunnel: no source, no destination, no key, no options */ > > #define HASH_SIZE 16 I think these two lines are unused now. > static int ipgre_net_id __read_mostly; > -static void ipgre_err(struct sk_buff *skb, u32 info) [...] > + int dev_type; [...] > + if (tpi->proto == htons(ETH_P_TEB)) { > + dev_type = ARPHRD_ETHER; > + itn = net_generic(net, ipgre_tap_net_id); > + } else { > + dev_type = ARPHRD_IPGRE; > + itn = net_generic(net, ipgre_net_id); > } It doesn't look to me like dev_type is used anywhere, either here or in ipgre_rcv(). Also, can we use a return code other than 1 for tunnel not found in these two functions? Something symbolic seems better to me. > static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct net_device > *dev) [...] > + err = ip_tunnel_build_iphdr(skb, dev, tiph, gre_hlen, &iph); > + if (err) > + return NETDEV_TX_OK; > + > + tpi.flags = tunnel->parms.o_flags; > + tpi.proto = proto; > + tpi.key = tunnel->parms.o_key; > + tpi.seq = htonl(tunnel->o_seqno); > + tpi.hdr_len = tunnel->hlen; > + if (skb_cow_head(skb, dev->needed_headroom)) { > + dev->stats.tx_dropped++; > dev_kfree_skb(skb); > - skb = new_skb; > - old_iph = ip_hdr(skb); > - } Can we push this down into the encapsulation code? It seems like a common data plane piece. > - if (tunnel->parms.o_flags&(GRE_KEY|GRE_CSUM|GRE_SEQ)) { > - __be32 *ptr = (__be32 *)(((u8 *)iph) + tunnel->hlen - 4); > + u64_stats_update_begin(&tstats->syncp); > + tstats->tx_bytes += tx_len; > + tstats->tx_packets++; > + u64_stats_update_end(&tstats->syncp); > > - if (tunnel->parms.o_flags&GRE_SEQ) { > + if (tunnel->parms.o_flags&GRE_SEQ) > ++tunnel->o_seqno; The current version increments the sequence number before transmission, so it always goes up regardless of the outcome. That seems a little more natural to me. Also, can we push the stats updates to ip_tunnel? They are incremented there everywhere else and it would be good to keep things in a single location to the greatest extent possible. > diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c > new file mode 100644 > index 0000000..e5e2fde > --- /dev/null > +++ b/net/ipv4/ip_tunnel.c > +static void free_linked_skbs(struct sk_buff *skb) > +{ > + while (skb) { > + struct sk_buff *next = skb->next; > + kfree_skb(skb); > + skb = next; > + } > +} > + > + Extra blank line. > +static int send_frags(struct sk_buff *skb, > + int tunnel_hlen) > +{ > + int sent_len; > + > + sent_len = 0; > + while (skb) { > + struct sk_buff *next = skb->next; > + int frag_len = skb->len - tunnel_hlen; > + int err; > + > + skb->next = NULL; > + memset(IPCB(skb), 0, sizeof(*IPCB(skb))); > + > + err = ip_local_out(skb); > + skb = next; > + if (unlikely(net_xmit_eval(err))) > + goto free_frags; > + sent_len += frag_len; > + } > + > + return sent_len; > + > +free_frags: > + /* > + * There's no point in continuing to send fragments once one has been > + * dropped so just free the rest. This may help improve the > congestion > + * that caused the first packet to be dropped. > + */ > + free_linked_skbs(skb); > + return sent_len; > +} > + > + Another double blank line here. > +int __net_init ip_tunnel_init_net(struct net *net, int ipt_net_id, > + struct rtnl_link_ops *ops) > +{ > + struct ip_tunnel_net *itn = net_generic(net, ipt_net_id); > + struct ip_tunnel_parm parms; > + > + itn->tunnels = kzalloc(sizeof(void *) * IPT_HASH_SIZE, GFP_KERNEL); > + if (!itn->tunnels) > + return -ENOMEM; > + > + if (!ops) { Presumably some type of ops needs to be supplied. Should the fallback device be created only if we have ioctl ops? > +/* Can be moved to ip-tunnel */ > +int ip_tunnel_newlink(struct net *src_net, struct net_device *dev, I think the comment here is out of date. > +void ip_tunnel_uninit(struct net_device *dev) > +{ > + struct net *net = dev_net(dev); > + struct ip_tunnel *tunnel = netdev_priv(dev); > + struct ip_tunnel_net *itn; > + > + itn = net_generic(net, tunnel->ipt_net_id); > + /* fb_tunnel_dev will be unregisted in net-exit call. */ > + if (itn->fb_tunnel_dev != dev) > + ip_tunnel_unlink(itn, netdev_priv(dev)); > +} > +EXPORT_SYMBOL(ip_tunnel_uninit); > + > + Double blank line here. > +int ipt_add_protocol(struct ipt_protocol *newp) > +{ > + struct hlist_head *head; > + struct ipt_protocol *ipt = NULL; > + struct hlist_node *n; > + ASSERT_RTNL(); > + > + head = ipt_hash_bucket(newp->type, newp->portno); > + > + hlist_for_each_entry_rcu(ipt, n, head, node) { > + if (ipt->type != newp->type || ipt->portno != newp->portno) > + continue; > + if (ipt->priority > newp->priority) { > + hlist_add_before_rcu(&newp->node, &ipt->node); > + return 0; > + } > + if (ipt->priority == newp->priority) > + return -1; Can you use a symbolic error code here? > +void ipt_del_protocol(struct ipt_protocol *proto) > +{ > + ASSERT_RTNL(); > + > + hlist_del_rcu(&proto->node); > + synchronize_rcu(); I would use synchronize_net() here. > +static int __init ip_tunnel_mod_init(void) > +{ > + pr_info("IP_Tunnel init\n"); I would either make this a little more descriptive or just remove it entirely. Right now it's not very human readable. > + get_random_bytes(&ip_tunnel_salt, sizeof(ip_tunnel_salt)); Do we actually need to hash in a random value? The hash entries aren't controlled by anything on the network. I'm still working on going through the rest of the ioctl/Netlink code but I wanted to get back to you with what I have. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev