On Tue, Mar 31, 2015 at 5:07 PM, Jesse Gross <je...@nicira.com> wrote:
> On Thu, Mar 26, 2015 at 4:41 PM, Pravin B Shelar <pshe...@nicira.com> wrote:
>> diff --git a/datapath/datapath.c b/datapath/datapath.c
>> index 7f431ed..ea9c6ae 100644
>> --- a/datapath/datapath.c
>> +++ b/datapath/datapath.c
>> @@ -2192,6 +2192,7 @@ static int __net_init ovs_init_net(struct net *net)
>>
>>         INIT_LIST_HEAD(&ovs_net->dps);
>>         INIT_WORK(&ovs_net->dp_notify_work, ovs_dp_notify_wq);
>> +       INIT_LIST_HEAD(&ovs_net->vport_net.stt_sock_list);
>>         return 0;
>>  }
>>
>
> In this previous version, this was a little bit more self contained.
> Did you run into a problem with that? If not, the other version seems
> a little cleaner to me, especially since this won't be upstream and so
> the less invasive it is, the better.
>

I saw dead lock if I register net-namesapce while adding vport. This
due to ABBA locking sequence with ovs-lock and net-mutex. STT port add
and net-exit can deadlock this way.
vport_net already had pointers for different vport. so I think adding
stt socket list is not much different.

>> diff --git a/datapath/linux/compat/stt.c b/datapath/linux/compat/stt.c
>> new file mode 100644
>> index 0000000..df643de
>> --- /dev/null
>> +++ b/datapath/linux/compat/stt.c
>> +static void copy_skb_metadata(struct sk_buff *to, struct sk_buff *from)
>> +{
>> +       to->tstamp = from->tstamp;
>> +       to->priority = from->priority;
>> +       to->mark = from->mark;
>> +       to->vlan_tci = from->vlan_tci;
>> +       skb_copy_secmark(to, from);
>> +}
>
> Is there any other metadata that we might need? What about vlan_proto?
> (I think we also need to check/set this when encapsulating and
> decapsulating.)
>
ok. I will look into this.

>> +static bool __linearize(struct sk_buff *head, bool ipv4, bool tcp, bool 
>> csum_partial)
>
> I might call this need_linearize() or similar, since it doesn't
> actually do any linearization.
>
ok. I changed it to can_segment().

>> +static int skb_list_linearize(struct sk_buff *head, gfp_t gfp_mask)
>> +{
>> +       struct sk_buff *skb;
>> +       int tlen = 0;
>> +       int err;
>> +
>> +       skb = head;
>> +       while (skb) {
>> +               tlen += skb->len;
>> +               skb = skb->next;
>> +       }
>> +
>> +       err = pskb_expand_head(head, 0, tlen, gfp_mask);
>> +       if (err)
>> +               return err;
>
> I think this includes the length of head in tlen, which is probably
> not necessary.
>
ok.

>> +static int __segment_skb(struct sk_buff **headp, bool ipv4, bool tcp, bool 
>> csum_partial, int l4_offset)
>> +{
>> +       int err;
>> +
>> +       err = straighten_frag_list(headp);
>> +       if (unlikely(err))
>> +               return err;
>> +
>> +       if (__linearize(*headp, ipv4, tcp, csum_partial))
>> +               return skb_list_linearize(*headp, GFP_ATOMIC);
>
> I don't think this check is in the right place. The conditions for
> linearizing only come into play if we can't fully coalesce things and
> would otherwise generate a frag_list. But we don't know that before
> trying to coalesce.
>

There is trade-off. If it is large skb we might not able to segment
multiple skb depending on protocols and csum_partial. I decided to
check for these parameters before so we do not waste cycles coalescing
skbs. But I do not have strong opinion on this. It can be easily
changed the way you suggested.

>> +static struct sk_buff *push_stt_header(struct sk_buff *head, __be64 tun_id,
>> +                                      __be16 s_port, __be16 d_port,
>> +                                      __be32 saddr, __be32 dst,
>> +                                      __be16 h_proto, u8 nw_proto,
>> +                                      int dst_mtu)
>> +{
>> +       struct sk_buff *skb;
>> +
>> +       if (skb_shinfo(head)->frag_list) {
>> +               bool ipv4 = (h_proto == htons(ETH_P_IP));
>> +               bool tcp = (nw_proto == IPPROTO_TCP);
>> +               bool csum_partial = (head->ip_summed == CHECKSUM_PARTIAL);
>> +               int l4_offset = skb_transport_offset(head);
>> +
>> +               if (unlikely(__segment_skb(&head, ipv4, tcp,
>> +                                          csum_partial, l4_offset))) {
>> +                       goto error;
>
> My guess is that not all of the conditions in __linearize apply to the
> transmit case (such as the one about offloading). It's worth checking.
>
ok.

> At a higher level, is there a reason why STT is special in this regard
> as far as inserting a header into a packet that has a frag_list? Don't
> other tunnels need to deal with it?
>
frag_list is handled because not all drivers can handle skbs with
frag_list. This can cause performance issues. Older kernel might not
be able to handle skb geometry that STT generates.

>> +static bool stt_can_offload(struct sk_buff *skb, __be16 h_proto, u8 
>> nw_proto)
>> +{
>
> We probably should have a more direct check for GSO types in here. We
> specifically check for SKB_GSO_TCP_ECN but a whitelist is better than
> a blacklist. I think the tunnel offloads are already a problem is
> somebody tries to do double encapsulation.
>
ok.

>> +int stt_xmit_skb(struct sk_buff *skb, struct rtable *rt,
>> +                __be32 src, __be32 dst, __u8 tos,
>> +                __u8 ttl, __be16 df, __be16 src_port, __be16 dst_port,
>> +                __be64 tun_id)
>> +{
> [...]
>> +       if (!stt_can_offload(skb, inner_h_proto, inner_nw_proto)) {
>> +               struct sk_buff *nskb;
>> +
>> +               nskb = handle_offloads(skb);
>> +               if (IS_ERR(nskb)) {
>> +                       ret = PTR_ERR(nskb);
>> +                       goto err_free_rt;
>> +               }
>> +               skb = nskb;
>
> We might have an issue with MPLS here - it wasn't available yet in 3.5
> so skb_gso_segment() will choke on it (we have backports but they
> don't run in direct calls to skb_gso_segment()).
>

Good catch. I think we have similar problem handling upcalls with MPLS
GSO packets.

> vlans should be OK by this kernel version, although we'll need to push
> vlans with non-standard EtherTypes into the packet.
>
>> +static void stt_rcv(struct stt_sock *stt_sock, struct sk_buff *skb)
>> +{
> [...]
>> +       if (unlikely(stt_hdr(skb)->version != 0))
>> +               goto drop;
>
> Does this need to do a pskb_may_pull() first?
>
ok.

>> +       err = iptunnel_pull_header(skb,
>> +                                  sizeof(struct stthdr) + STT_ETH_PAD,
>> +                                  htons(ETH_P_TEB));
>> +       if (unlikely(err))
>> +               goto drop;
>> +
>> +       if (unlikely(!set_offloads(skb)))
>> +               goto drop;
>>
>
> I don't think the above operations are legal to do on an skb that
> hasn't been merged yet. If the header spans STT segments, they will
> fail even if it shouldn't.
>
ok, I will move coalescing above this operation.

>> diff --git a/datapath/vport.c b/datapath/vport.c
>> index 5a7067b..697ee6a 100644
>> --- a/datapath/vport.c
>> +++ b/datapath/vport.c
>> @@ -30,6 +30,7 @@
>>  #include <linux/compat.h>
>>  #include <linux/version.h>
>>  #include <net/net_namespace.h>
>> +#include <net/stt.h>
>
> I think STT internals probably don't need to leak into vport through
> the header file.
>
ok.

>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>> index 81e8b3f..d6cb443 100644
>> --- a/vswitchd/vswitch.xml
>> +++ b/vswitchd/vswitch.xml
>> @@ -1863,6 +1863,17 @@
>>              </p>
>>            </dd>
>>
>> +          <dt><code>stt</code></dt>
>> +          <dd>
>> +             The Stateless TCP Tunnel (STT) protocol encapsulates traffic in
>> +             IPv4/TCP packets.  All traffic uses a default destination port 
>> of 7471.
>> +             The STT protocol does not engage in the usual TCP 3-way 
>> handshake,
>> +             so it will have difficulty traversing stateful firewalls. 
>> Therefore
>> +             it is better suited for hypervisor to hypervisor tunneling 
>> within
>> +             data center.
>> +             STT is only available in kernel datapath on kernel 3.5 or 
>> newer.
>> +          </dd>
>
> Can you make the documentation a little more descriptive of the protocol?
ok.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to