On Tue, Mar 31, 2015 at 5:07 PM, Jesse Gross <je...@nicira.com> wrote: > On Thu, Mar 26, 2015 at 4:41 PM, Pravin B Shelar <pshe...@nicira.com> wrote: >> diff --git a/datapath/datapath.c b/datapath/datapath.c >> index 7f431ed..ea9c6ae 100644 >> --- a/datapath/datapath.c >> +++ b/datapath/datapath.c >> @@ -2192,6 +2192,7 @@ static int __net_init ovs_init_net(struct net *net) >> >> INIT_LIST_HEAD(&ovs_net->dps); >> INIT_WORK(&ovs_net->dp_notify_work, ovs_dp_notify_wq); >> + INIT_LIST_HEAD(&ovs_net->vport_net.stt_sock_list); >> return 0; >> } >> > > In this previous version, this was a little bit more self contained. > Did you run into a problem with that? If not, the other version seems > a little cleaner to me, especially since this won't be upstream and so > the less invasive it is, the better. >
I saw dead lock if I register net-namesapce while adding vport. This due to ABBA locking sequence with ovs-lock and net-mutex. STT port add and net-exit can deadlock this way. vport_net already had pointers for different vport. so I think adding stt socket list is not much different. >> diff --git a/datapath/linux/compat/stt.c b/datapath/linux/compat/stt.c >> new file mode 100644 >> index 0000000..df643de >> --- /dev/null >> +++ b/datapath/linux/compat/stt.c >> +static void copy_skb_metadata(struct sk_buff *to, struct sk_buff *from) >> +{ >> + to->tstamp = from->tstamp; >> + to->priority = from->priority; >> + to->mark = from->mark; >> + to->vlan_tci = from->vlan_tci; >> + skb_copy_secmark(to, from); >> +} > > Is there any other metadata that we might need? What about vlan_proto? > (I think we also need to check/set this when encapsulating and > decapsulating.) > ok. I will look into this. >> +static bool __linearize(struct sk_buff *head, bool ipv4, bool tcp, bool >> csum_partial) > > I might call this need_linearize() or similar, since it doesn't > actually do any linearization. > ok. I changed it to can_segment(). >> +static int skb_list_linearize(struct sk_buff *head, gfp_t gfp_mask) >> +{ >> + struct sk_buff *skb; >> + int tlen = 0; >> + int err; >> + >> + skb = head; >> + while (skb) { >> + tlen += skb->len; >> + skb = skb->next; >> + } >> + >> + err = pskb_expand_head(head, 0, tlen, gfp_mask); >> + if (err) >> + return err; > > I think this includes the length of head in tlen, which is probably > not necessary. > ok. >> +static int __segment_skb(struct sk_buff **headp, bool ipv4, bool tcp, bool >> csum_partial, int l4_offset) >> +{ >> + int err; >> + >> + err = straighten_frag_list(headp); >> + if (unlikely(err)) >> + return err; >> + >> + if (__linearize(*headp, ipv4, tcp, csum_partial)) >> + return skb_list_linearize(*headp, GFP_ATOMIC); > > I don't think this check is in the right place. The conditions for > linearizing only come into play if we can't fully coalesce things and > would otherwise generate a frag_list. But we don't know that before > trying to coalesce. > There is trade-off. If it is large skb we might not able to segment multiple skb depending on protocols and csum_partial. I decided to check for these parameters before so we do not waste cycles coalescing skbs. But I do not have strong opinion on this. It can be easily changed the way you suggested. >> +static struct sk_buff *push_stt_header(struct sk_buff *head, __be64 tun_id, >> + __be16 s_port, __be16 d_port, >> + __be32 saddr, __be32 dst, >> + __be16 h_proto, u8 nw_proto, >> + int dst_mtu) >> +{ >> + struct sk_buff *skb; >> + >> + if (skb_shinfo(head)->frag_list) { >> + bool ipv4 = (h_proto == htons(ETH_P_IP)); >> + bool tcp = (nw_proto == IPPROTO_TCP); >> + bool csum_partial = (head->ip_summed == CHECKSUM_PARTIAL); >> + int l4_offset = skb_transport_offset(head); >> + >> + if (unlikely(__segment_skb(&head, ipv4, tcp, >> + csum_partial, l4_offset))) { >> + goto error; > > My guess is that not all of the conditions in __linearize apply to the > transmit case (such as the one about offloading). It's worth checking. > ok. > At a higher level, is there a reason why STT is special in this regard > as far as inserting a header into a packet that has a frag_list? Don't > other tunnels need to deal with it? > frag_list is handled because not all drivers can handle skbs with frag_list. This can cause performance issues. Older kernel might not be able to handle skb geometry that STT generates. >> +static bool stt_can_offload(struct sk_buff *skb, __be16 h_proto, u8 >> nw_proto) >> +{ > > We probably should have a more direct check for GSO types in here. We > specifically check for SKB_GSO_TCP_ECN but a whitelist is better than > a blacklist. I think the tunnel offloads are already a problem is > somebody tries to do double encapsulation. > ok. >> +int stt_xmit_skb(struct sk_buff *skb, struct rtable *rt, >> + __be32 src, __be32 dst, __u8 tos, >> + __u8 ttl, __be16 df, __be16 src_port, __be16 dst_port, >> + __be64 tun_id) >> +{ > [...] >> + if (!stt_can_offload(skb, inner_h_proto, inner_nw_proto)) { >> + struct sk_buff *nskb; >> + >> + nskb = handle_offloads(skb); >> + if (IS_ERR(nskb)) { >> + ret = PTR_ERR(nskb); >> + goto err_free_rt; >> + } >> + skb = nskb; > > We might have an issue with MPLS here - it wasn't available yet in 3.5 > so skb_gso_segment() will choke on it (we have backports but they > don't run in direct calls to skb_gso_segment()). > Good catch. I think we have similar problem handling upcalls with MPLS GSO packets. > vlans should be OK by this kernel version, although we'll need to push > vlans with non-standard EtherTypes into the packet. > >> +static void stt_rcv(struct stt_sock *stt_sock, struct sk_buff *skb) >> +{ > [...] >> + if (unlikely(stt_hdr(skb)->version != 0)) >> + goto drop; > > Does this need to do a pskb_may_pull() first? > ok. >> + err = iptunnel_pull_header(skb, >> + sizeof(struct stthdr) + STT_ETH_PAD, >> + htons(ETH_P_TEB)); >> + if (unlikely(err)) >> + goto drop; >> + >> + if (unlikely(!set_offloads(skb))) >> + goto drop; >> > > I don't think the above operations are legal to do on an skb that > hasn't been merged yet. If the header spans STT segments, they will > fail even if it shouldn't. > ok, I will move coalescing above this operation. >> diff --git a/datapath/vport.c b/datapath/vport.c >> index 5a7067b..697ee6a 100644 >> --- a/datapath/vport.c >> +++ b/datapath/vport.c >> @@ -30,6 +30,7 @@ >> #include <linux/compat.h> >> #include <linux/version.h> >> #include <net/net_namespace.h> >> +#include <net/stt.h> > > I think STT internals probably don't need to leak into vport through > the header file. > ok. >> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml >> index 81e8b3f..d6cb443 100644 >> --- a/vswitchd/vswitch.xml >> +++ b/vswitchd/vswitch.xml >> @@ -1863,6 +1863,17 @@ >> </p> >> </dd> >> >> + <dt><code>stt</code></dt> >> + <dd> >> + The Stateless TCP Tunnel (STT) protocol encapsulates traffic in >> + IPv4/TCP packets. All traffic uses a default destination port >> of 7471. >> + The STT protocol does not engage in the usual TCP 3-way >> handshake, >> + so it will have difficulty traversing stateful firewalls. >> Therefore >> + it is better suited for hypervisor to hypervisor tunneling >> within >> + data center. >> + STT is only available in kernel datapath on kernel 3.5 or >> newer. >> + </dd> > > Can you make the documentation a little more descriptive of the protocol? ok. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev