> -----Original Message-----
> From: Joe Stringer [mailto:j...@ovn.org]
> Sent: Tuesday, May 24, 2016 7:26 PM
> To: Fischetti, Antonio <antonio.fische...@intel.com>
> Cc: Daniele Di Proietto <diproiet...@vmware.com>; dev@openvswitch.org
> Subject: Re: [ovs-dev] [PATCH v3 04/16] conntrack: New userspace
> connection tracker.
> 
> On 24 May 2016 at 07:19, Fischetti, Antonio
> <antonio.fische...@intel.com> wrote:
> > Hi Daniele, just a comment below.
> > Apart from that, it looks good to me, thanks.
> >
> > Acked-by: Antonio Fischetti <antonio.fische...@intel.com>
> >
> >> -----Original Message-----
> >> From: dev [mailto:dev-boun...@openvswitch.org] On Behalf Of
> Daniele
> >> Di Proietto
> >> Sent: Tuesday, May 17, 2016 1:56 AM
> >> To: dev@openvswitch.org
> >> Subject: [ovs-dev] [PATCH v3 04/16] conntrack: New userspace
> >> connection tracker.
> >>
> >> This commit adds the conntrack module.
> >>
> >> It is a connection tracker that resides entirely in userspace.
> Its
> >> primary user will be the dpif-netdev datapath.
> >>
> >> The module main goal is to provide conntrack_execute(), which
> offers
> >> a
> >> convenient interface to implement the datapath ct() action.
> >>
> >> The conntrack module uses two submodules to deal with the l4
> protocol
> >> details (conntrack-other for UDP and ICMP, conntrack-tcp for TCP).
> >>
> >> The conntrack-tcp submodule implementation is adapted from
> FreeBSD's
> >> pf
> >> subsystem, therefore it's BSD licensed.  It has been slightly
> altered
> >> to
> >> match the OVS coding style and to allow the pickup of already
> >> established connections.
> >>
> >> Signed-off-by: Daniele Di Proietto <diproiet...@vmware.com>
> >> ---
> >>  COPYING                     |   1 +
> >>  debian/copyright.in         |   4 +
> >>  include/openvswitch/types.h |   4 +
> >>  lib/automake.mk             |   5 +
> >>  lib/conntrack-other.c       |  85 +++++
> >>  lib/conntrack-private.h     |  88 +++++
> >>  lib/conntrack-tcp.c         | 463 +++++++++++++++++++++++
> >>  lib/conntrack.c             | 883
> >> ++++++++++++++++++++++++++++++++++++++++++++
> >>  lib/conntrack.h             | 151 ++++++++
> >>  lib/util.h                  |   9 +
> >>  10 files changed, 1693 insertions(+)
> >>  create mode 100644 lib/conntrack-other.c
> >>  create mode 100644 lib/conntrack-private.h
> >>  create mode 100644 lib/conntrack-tcp.c
> >>  create mode 100644 lib/conntrack.c
> >>  create mode 100644 lib/conntrack.h
> >>
> >> diff --git a/COPYING b/COPYING
> >> index 308e3ea..afb98b9 100644
> >> --- a/COPYING
> >> +++ b/COPYING
> >> @@ -25,6 +25,7 @@ License, version 2.
> >>  The following files are licensed under the 2-clause BSD license.
> >>      include/windows/getopt.h
> >>      lib/getopt_long.c
> >> +    lib/conntrack-tcp.c
> >>
> >>  The following files are licensed under the 3-clause BSD-license
> >>      include/windows/netinet/icmp6.h
> >> diff --git a/debian/copyright.in b/debian/copyright.in
> >> index 57d007a..a15f4dd 100644
> >> --- a/debian/copyright.in
> >> +++ b/debian/copyright.in
> >> @@ -21,6 +21,9 @@ Upstream Copyright Holders:
> >>       Copyright (c) 2014 Michael Chapman
> >>       Copyright (c) 2014 WindRiver, Inc.
> >>       Copyright (c) 2014 Avaya, Inc.
> >> +     Copyright (c) 2001 Daniel Hartmeier
> >> +     Copyright (c) 2002 - 2008 Henning Brauer
> >> +     Copyright (c) 2012 Gleb Smirnoff <gleb...@freebsd.org>
> >>
> >>  License:
> >>
> >> @@ -90,6 +93,7 @@ License:
> >>       lib/getopt_long.c
> >>       include/windows/getopt.h
> >>       datapath-windows/ovsext/Conntrack-tcp.c
> >> +     lib/conntrack-tcp.c
> >>
> >>  * The following files are licensed under the 3-clause BSD-license
> >>
> >> diff --git a/include/openvswitch/types.h
> >> b/include/openvswitch/types.h
> >> index 5f3347d..d7e94a6 100644
> >> --- a/include/openvswitch/types.h
> >> +++ b/include/openvswitch/types.h
> >> @@ -107,6 +107,10 @@ static const ovs_u128 OVS_U128_MAX = { {
> >> UINT32_MAX, UINT32_MAX,
> >>                                           UINT32_MAX, UINT32_MAX }
> };
> >>  static const ovs_be128 OVS_BE128_MAX OVS_UNUSED = { {
> OVS_BE32_MAX,
> >> OVS_BE32_MAX,
> >>                                             OVS_BE32_MAX,
> >> OVS_BE32_MAX } };
> >> +static const ovs_u128 OVS_U128_MIN OVS_UNUSED = { {0, 0, 0, 0} };
> >> +static const ovs_u128 OVS_BE128_MIN OVS_UNUSED = { {0, 0, 0, 0}
> };
> >> +
> >> +#define OVS_U128_ZERO OVS_U128_MIN
> >>
> >>  /* A 64-bit value, in network byte order, that is only aligned on
> a
> >> 32-bit
> >>   * boundary. */
> >> diff --git a/lib/automake.mk b/lib/automake.mk
> >> index affbb5c..df8b07d 100644
> >> --- a/lib/automake.mk
> >> +++ b/lib/automake.mk
> >> @@ -47,6 +47,11 @@ lib_libopenvswitch_la_SOURCES = \
> >>       lib/compiler.h \
> >>       lib/connectivity.c \
> >>       lib/connectivity.h \
> >> +     lib/conntrack-private.h \
> >> +     lib/conntrack-tcp.c \
> >> +     lib/conntrack-other.c \
> >> +     lib/conntrack.c \
> >> +     lib/conntrack.h \
> >>       lib/coverage.c \
> >>       lib/coverage.h \
> >>       lib/crc32c.c \
> >> diff --git a/lib/conntrack-other.c b/lib/conntrack-other.c
> >> new file mode 100644
> >> index 0000000..295cb2c
> >> --- /dev/null
> >> +++ b/lib/conntrack-other.c
> >> @@ -0,0 +1,85 @@
> >> +/*
> >> + * Copyright (c) 2015, 2016 Nicira, Inc.
> >> + *
> >> + * Licensed under the Apache License, Version 2.0 (the
> "License");
> >> + * you may not use this file except in compliance with the
> License.
> >> + * You may obtain a copy of the License at:
> >> + *
> >> + *     http://www.apache.org/licenses/LICENSE-2.0
> >> + *
> >> + * Unless required by applicable law or agreed to in writing,
> >> software
> >> + * distributed under the License is distributed on an "AS IS"
> BASIS,
> >> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
> or
> >> implied.
> >> + * See the License for the specific language governing
> permissions
> >> and
> >> + * limitations under the License.
> >> + */
> >> +
> >> +#include <config.h>
> >> +
> >> +#include "conntrack-private.h"
> >> +#include "dp-packet.h"
> >> +
> >> +enum other_state {
> >> +    OTHERS_FIRST,
> >> +    OTHERS_MULTIPLE,
> >> +    OTHERS_BIDIR,
> >> +};
> >> +
> >> +struct conn_other {
> >> +    struct conn up;
> >> +    enum other_state state;
> >> +};
> >> +
> >> +static const enum ct_timeout other_timeouts[] = {
> >> +    [OTHERS_FIRST] = CT_TM_OTHER_FIRST,
> >> +    [OTHERS_MULTIPLE] = CT_TM_OTHER_MULTIPLE,
> >> +    [OTHERS_BIDIR] = CT_TM_OTHER_BIDIR,
> >> +};
> >> +
> >> +static struct conn_other *
> >> +conn_other_cast(const struct conn *conn)
> >> +{
> >> +    return CONTAINER_OF(conn, struct conn_other, up);
> >> +}
> >> +
> >> +static enum ct_update_res
> >> +other_conn_update(struct conn *conn_, struct dp_packet *pkt
> >> OVS_UNUSED,
> >> +                  bool reply, long long now)
> >> +{
> >> +    struct conn_other *conn = conn_other_cast(conn_);
> >> +
> >> +    if (reply && conn->state != OTHERS_BIDIR) {
> >> +        conn->state = OTHERS_BIDIR;
> >> +    } else if (conn->state == OTHERS_FIRST) {
> >> +        conn->state = OTHERS_MULTIPLE;
> >> +    }
> >> +
> >> +    update_expiration(conn_, other_timeouts[conn->state], now);
> >> +
> >> +    return CT_UPDATE_VALID;
> >> +}
> >> +
> >> +static bool
> >> +other_valid_new(struct dp_packet *pkt OVS_UNUSED)
> >> +{
> >> +    return true;
> >> +}
> >> +
> >> +static struct conn *
> >> +other_new_conn(struct dp_packet *pkt OVS_UNUSED, long long now)
> >> +{
> >> +    struct conn_other *conn;
> >> +
> >> +    conn = xzalloc(sizeof *conn);
> >> +    conn->state = OTHERS_FIRST;
> >> +
> >> +    update_expiration(&conn->up, other_timeouts[conn->state],
> now);
> >> +
> >> +    return &conn->up;
> >> +}
> >> +
> >> +struct ct_l4_proto ct_proto_other = {
> >> +    .new_conn = other_new_conn,
> >> +    .valid_new = other_valid_new,
> >> +    .conn_update = other_conn_update,
> >> +};
> >> diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
> >> new file mode 100644
> >> index 0000000..d3e0099
> >> --- /dev/null
> >> +++ b/lib/conntrack-private.h
> >> @@ -0,0 +1,88 @@
> >> +/*
> >> + * Copyright (c) 2015, 2016 Nicira, Inc.
> >> + *
> >> + * Licensed under the Apache License, Version 2.0 (the
> "License");
> >> + * you may not use this file except in compliance with the
> License.
> >> + * You may obtain a copy of the License at:
> >> + *
> >> + *     http://www.apache.org/licenses/LICENSE-2.0
> >> + *
> >> + * Unless required by applicable law or agreed to in writing,
> >> software
> >> + * distributed under the License is distributed on an "AS IS"
> BASIS,
> >> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
> or
> >> implied.
> >> + * See the License for the specific language governing
> permissions
> >> and
> >> + * limitations under the License.
> >> + */
> >> +
> >> +#ifndef CONNTRACK_PRIVATE_H
> >> +#define CONNTRACK_PRIVATE_H 1
> >> +
> >> +#include <sys/types.h>
> >> +#include <netinet/in.h>
> >> +#include <netinet/ip6.h>
> >> +
> >> +#include "conntrack.h"
> >> +#include "hmap.h"
> >> +#include "openvswitch/list.h"
> >> +#include "openvswitch/types.h"
> >> +#include "packets.h"
> >> +#include "unaligned.h"
> >> +
> >> +struct ct_addr {
> >> +    union {
> >> +        ovs_16aligned_be32 ipv4;
> >> +        union ovs_16aligned_in6_addr ipv6;
> >> +        ovs_be32 ipv4_aligned;
> >> +        struct in6_addr ipv6_aligned;
> >> +    };
> >> +};
> >> +
> >> +struct ct_endpoint {
> >> +    struct ct_addr addr;
> >> +    ovs_be16 port;
> >> +};
> >> +
> >> +struct conn_key {
> >> +    struct ct_endpoint src;
> >> +    struct ct_endpoint dst;
> >> +
> >> +    ovs_be16 dl_type;
> >> +    uint8_t nw_proto;
> >> +    uint16_t zone;
> >> +};
> >> +
> >> +struct conn {
> >> +    struct conn_key key;
> >> +    struct conn_key rev_key;
> >> +    long long expiration;
> >> +    struct ovs_list exp_node;
> >> +    struct hmap_node node;
> >> +    uint32_t mark;
> >> +    ovs_u128 label;
> >> +};
> >> +
> >> +enum ct_update_res {
> >> +    CT_UPDATE_INVALID,
> >> +    CT_UPDATE_VALID,
> >> +    CT_UPDATE_NEW,
> >> +};
> >> +
> >> +struct ct_l4_proto {
> >> +    struct conn *(*new_conn)(struct dp_packet *pkt, long long
> now);
> >> +    bool (*valid_new)(struct dp_packet *pkt);
> >> +    enum ct_update_res (*conn_update)(struct conn *conn, struct
> >> dp_packet *pkt,
> >> +                                      bool reply, long long now);
> >> +};
> >> +
> >> +extern struct ct_l4_proto ct_proto_tcp;
> >> +extern struct ct_l4_proto ct_proto_other;
> >> +
> >> +extern long long ct_timeout_val[];
> >> +
> >> +static inline void
> >> +update_expiration(struct conn *conn, enum ct_timeout tm, long
> long
> >> now)
> >> +{
> >> +    conn->expiration = now + ct_timeout_val[tm];
> >> +}
> >> +
> >> +#endif /* conntrack-private.h */
> >> diff --git a/lib/conntrack-tcp.c b/lib/conntrack-tcp.c
> >> new file mode 100644
> >> index 0000000..4d3d9c3
> >> --- /dev/null
> >> +++ b/lib/conntrack-tcp.c
> >> @@ -0,0 +1,463 @@
> >> +/*-
> >> + * Copyright (c) 2001 Daniel Hartmeier
> >> + * Copyright (c) 2002 - 2008 Henning Brauer
> >> + * Copyright (c) 2012 Gleb Smirnoff <gleb...@freebsd.org>
> >> + * Copyright (c) 2015, 2016 Nicira, Inc.
> >> + * All rights reserved.
> >> + *
> >> + * Redistribution and use in source and binary forms, with or
> >> without
> >> + * modification, are permitted provided that the following
> >> conditions
> >> + * are met:
> >> + *
> >> + *    - Redistributions of source code must retain the above
> >> copyright
> >> + *      notice, this list of conditions and the following
> >> disclaimer.
> >> + *    - Redistributions in binary form must reproduce the above
> >> + *      copyright notice, this list of conditions and the
> following
> >> + *      disclaimer in the documentation and/or other materials
> >> provided
> >> + *      with the distribution.
> >> + *
> >> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> >> CONTRIBUTORS
> >> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> NOT
> >> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> FITNESS
> >> + * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> >> + * COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
> >> INDIRECT,
> >> + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
> >> (INCLUDING,
> >> + * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
> SERVICES;
> >> + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> HOWEVER
> >> + * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> >> STRICT
> >> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
> IN
> >> + * ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
> THE
> >> + * POSSIBILITY OF SUCH DAMAGE.
> >> + *
> >> + * Effort sponsored in part by the Defense Advanced Research
> >> Projects
> >> + * Agency (DARPA) and Air Force Research Laboratory, Air Force
> >> + * Materiel Command, USAF, under agreement number F30602-01-2-
> 0537.
> >> + *
> >> + *      $OpenBSD: pf.c,v 1.634 2009/02/27 12:37:45 henning Exp $
> >> + */
> >> +
> >> +#include <config.h>
> >> +
> >> +#include "conntrack-private.h"
> >> +#include "ct-dpif.h"
> >> +#include "dp-packet.h"
> >> +#include "util.h"
> >> +
> >> +struct tcp_peer {
> >> +    enum ct_dpif_tcp_state state;
> >> +    uint32_t               seqlo;          /* Max sequence number
> >> sent     */
> >> +    uint32_t               seqhi;          /* Max the other end
> ACKd
> >> + win */
> >> +    uint16_t               max_win;        /* largest window (pre
> >> scaling) */
> >> +    uint8_t                wscale;         /* window scaling
> factor
> >> */
> >> +};
> >> +
> >> +struct conn_tcp {
> >> +    struct conn up;
> >> +    struct tcp_peer peer[2];
> >> +};
> >> +
> >> +enum {
> >> +    TCPOPT_EOL,
> >> +    TCPOPT_NOP,
> >> +    TCPOPT_WINDOW = 3,
> >> +};
> >> +
> >> +/* TCP sequence numbers are 32 bit integers operated
> >> + * on with modular arithmetic.  These macros can be
> >> + * used to compare such integers. */
> >> +#define SEQ_LT(a,b)     INT_MOD_LT(a, b)
> >> +#define SEQ_LEQ(a,b)    INT_MOD_LEQ(a, b)
> >> +#define SEQ_GT(a,b)     INT_MOD_GT(a, b)
> >> +#define SEQ_GEQ(a,b)    INT_MOD_GEQ(a, b)
> >> +
> >> +#define SEQ_MIN(a, b)   INT_MOD_MIN(a, b)
> >> +#define SEQ_MAX(a, b)   INT_MOD_MAX(a, b)
> >> +
> >> +static struct conn_tcp*
> >> +conn_tcp_cast(const struct conn* conn)
> >> +{
> >> +    return CONTAINER_OF(conn, struct conn_tcp, up);
> >> +}
> >> +
> >> +/* pf does this in in pf_normalize_tcp(), and it is called only
> if
> >> scrub
> >> + * is enabled.  We're not scrubbing, but this check seems
> >> reasonable.  */
> >> +static bool
> >> +tcp_invalid_flags(uint16_t flags)
> >> +{
> >> +
> >> +    if (flags & TCP_SYN) {
> >> +        if (flags & TCP_RST || flags & TCP_FIN) {
> >> +            return true;
> >> +        }
> >> +    } else {
> >> +        /* Illegal packet */
> >> +        if (!(flags & (TCP_ACK|TCP_RST))) {
> >> +            return true;
> >> +        }
> >> +    }
> >> +
> >> +    if (!(flags & TCP_ACK)) {
> >> +        /* These flags are only valid if ACK is set */
> >> +        if ((flags & TCP_FIN) || (flags & TCP_PSH) || (flags &
> >> TCP_URG)) {
> >> +            return true;
> >> +        }
> >> +    }
> >> +
> >> +    return false;
> >> +}
> >> +
> >> +#define TCP_MAX_WSCALE 14
> >> +#define CT_WSCALE_FLAG 0x80
> >> +#define CT_WSCALE_UNKNOWN 0x40
> >> +#define CT_WSCALE_MASK 0xf
> >> +
> >> +static uint8_t
> >> +tcp_get_wscale(const struct tcp_header *tcp)
> >> +{
> >> +    int len = TCP_OFFSET(tcp->tcp_ctl) * 4 - sizeof *tcp;
> >> +    const uint8_t *opt = (const uint8_t *)(tcp + 1);
> >> +    uint8_t wscale = 0;
> >> +    uint8_t optlen;
> >> +
> >> +    while (len >= 3) {
> >> +        switch (*opt) {
> >> +        case TCPOPT_EOL:
> >> +            return wscale;
> >> +        case TCPOPT_NOP:
> >> +            opt++;
> >> +            len--;
> >> +            break;
> >> +        case TCPOPT_WINDOW:
> >> +            wscale = MIN(opt[2], TCP_MAX_WSCALE);
> >> +            wscale |= CT_WSCALE_FLAG;
> >> +            /* fall through */
> >> +        default:
> >> +            optlen = opt[1];
> >> +            if (optlen < 2) {
> >> +                optlen = 2;
> >> +            }
> >> +            len -= optlen;
> >> +            opt += optlen;
> >> +        }
> >> +    }
> >> +
> >> +    return wscale;
> >> +}
> >> +
> >> +static uint32_t
> >> +tcp_payload_length(struct dp_packet *pkt)
> >> +{
> >> +    return (char *) dp_packet_tail(pkt) -
> dp_packet_l2_pad_size(pkt)
> >> +           - (char *) dp_packet_get_tcp_payload(pkt);
> >> +}
> >> +
> >> +static enum ct_update_res
> >> +tcp_conn_update(struct conn* conn_, struct dp_packet *pkt, bool
> >> reply,
> >> +                long long now)
> >> +{
> >> +    struct conn_tcp *conn = conn_tcp_cast(conn_);
> >> +    struct tcp_header *tcp = dp_packet_l4(pkt);
> >> +    /* The peer that sent 'pkt' */
> >> +    struct tcp_peer *src = &conn->peer[reply ? 1 : 0];
> >> +    /* The peer that should receive 'pkt' */
> >> +    struct tcp_peer *dst = &conn->peer[reply ? 0 : 1];
> >> +    uint8_t sws = 0, dws = 0;
> >> +    uint16_t tcp_flags = TCP_FLAGS(tcp->tcp_ctl);
> >> +
> >> +    uint16_t win = ntohs(tcp->tcp_winsz);
> >> +    uint32_t ack, end, seq, orig_seq;
> >> +    uint32_t p_len = tcp_payload_length(pkt);
> >> +    int ackskew;
> >> +
> >> +    if (tcp_invalid_flags(tcp_flags)) {
> >> +        return CT_UPDATE_INVALID;
> >> +    }
> >> +
> >> +    if (((tcp_flags & (TCP_SYN|TCP_ACK)) == TCP_SYN) &&
> >> +            dst->state >= CT_DPIF_TCPS_FIN_WAIT_2 &&
> >> +            src->state >= CT_DPIF_TCPS_FIN_WAIT_2) {
> >> +        src->state = dst->state = CT_DPIF_TCPS_CLOSED;
> >> +        return CT_UPDATE_NEW;
> >> +    }
> >> +
> >> +    if (src->wscale & CT_WSCALE_FLAG
> >> +        && dst->wscale & CT_WSCALE_FLAG
> >> +        && !(tcp_flags & TCP_SYN)) {
> >> +
> >> +        sws = src->wscale & CT_WSCALE_MASK;
> >> +        dws = dst->wscale & CT_WSCALE_MASK;
> >> +
> >> +    } else if (src->wscale & CT_WSCALE_UNKNOWN
> >> +        && dst->wscale & CT_WSCALE_UNKNOWN
> >> +        && !(tcp_flags & TCP_SYN)) {
> >> +
> >> +        sws = TCP_MAX_WSCALE;
> >> +        dws = TCP_MAX_WSCALE;
> >> +    }
> >> +
> >> +    /*
> >> +     * Sequence tracking algorithm from Guido van Rooij's paper:
> >> +     *   http://www.madison-
> gurkha.com/publications/tcp_filtering/
> >> +     *      tcp_filtering.ps
> >> +     */
> >> +
> >> +    orig_seq = seq = ntohl(get_16aligned_be32(&tcp->tcp_seq));
> >> +    if (src->state < CT_DPIF_TCPS_SYN_SENT) {
> >> +        /* First packet from this end. Set its state */
> >> +
> >> +        ack = ntohl(get_16aligned_be32(&tcp->tcp_ack));
> >> +
> >> +        end = seq + p_len;
> >> +        if (tcp_flags & TCP_SYN) {
> >> +            end++;
> >> +            if (dst->wscale & CT_WSCALE_FLAG) {
> >> +                src->wscale = tcp_get_wscale(tcp);
> >> +                if (src->wscale & CT_WSCALE_FLAG) {
> >> +                    /* Remove scale factor from initial window */
> >> +                    sws = src->wscale & CT_WSCALE_MASK;
> >> +                    win = DIV_ROUND_UP((uint32_t) win, 1 << sws);
> >> +                    dws = dst->wscale & CT_WSCALE_MASK;
> >> +                } else {
> >> +                    /* fixup other window */
> >> +                    dst->max_win <<= dst->wscale &
> >> +                        CT_WSCALE_MASK;
> >> +                    /* in case of a retrans SYN|ACK */
> >> +                    dst->wscale = 0;
> >> +                }
> >> +            }
> >> +        }
> >> +        if (tcp_flags & TCP_FIN) {
> >> +            end++;
> >> +        }
> >> +
> >> +        src->seqlo = seq;
> >> +        src->state = CT_DPIF_TCPS_SYN_SENT;
> >> +        /*
> >> +         * May need to slide the window (seqhi may have been set
> by
> >> +         * the crappy stack check or if we picked up the
> connection
> >> +         * after establishment)
> >> +         */
> >> +        if (src->seqhi == 1 ||
> >> +                SEQ_GEQ(end + MAX(1, dst->max_win << dws), src-
> >> >seqhi)) {
> >> +            src->seqhi = end + MAX(1, dst->max_win << dws);
> >> +        }
> >> +        if (win > src->max_win) {
> >> +            src->max_win = win;
> >> +        }
> >> +
> >> +    } else {
> >> +        ack = ntohl(get_16aligned_be32(&tcp->tcp_ack));
> >> +        end = seq + p_len;
> >> +        if (tcp_flags & TCP_SYN) {
> >> +            end++;
> >> +        }
> >> +        if (tcp_flags & TCP_FIN) {
> >> +            end++;
> >> +        }
> >> +    }
> >> +
> >> +    if ((tcp_flags & TCP_ACK) == 0) {
> >> +        /* Let it pass through the ack skew check */
> >> +        ack = dst->seqlo;
> >> +    } else if ((ack == 0
> >> +                && (tcp_flags & (TCP_ACK|TCP_RST)) ==
> >> (TCP_ACK|TCP_RST))
> >> +               /* broken tcp stacks do not set ack */) {
> >> +        /* Many stacks (ours included) will set the ACK number in
> an
> >> +         * FIN|ACK if the SYN times out -- no sequence to ACK. */
> >> +        ack = dst->seqlo;
> >> +    }
> >> +
> >> +    if (seq == end) {
> >> +        /* Ease sequencing restrictions on no data packets */
> >> +        seq = src->seqlo;
> >> +        end = seq;
> >> +    }
> >> +
> >> +    ackskew = dst->seqlo - ack;
> >> +#define MAXACKWINDOW (0xffff + 1500)    /* 1500 is an arbitrary
> >> fudge factor */
> >> +    if (SEQ_GEQ(src->seqhi, end)
> >> +        /* Last octet inside other's window space */
> >> +        && SEQ_GEQ(seq, src->seqlo - (dst->max_win << dws))
> >> +        /* Retrans: not more than one window back */
> >> +        && (ackskew >= -MAXACKWINDOW)
> >> +        /* Acking not more than one reassembled fragment
> backwards
> >> */
> >> +        && (ackskew <= (MAXACKWINDOW << sws))
> >> +        /* Acking not more than one window forward */
> >> +        && ((tcp_flags & TCP_RST) == 0 || orig_seq == src->seqlo
> >> +            || (orig_seq == src->seqlo + 1) || (orig_seq + 1 ==
> src-
> >> >seqlo))) {
> >> +        /* Require an exact/+1 sequence match on resets when
> >> possible */
> >> +
> >> +        /* update max window */
> >> +        if (src->max_win < win) {
> >> +            src->max_win = win;
> >> +        }
> >> +        /* synchronize sequencing */
> >> +        if (SEQ_GT(end, src->seqlo)) {
> >> +            src->seqlo = end;
> >> +        }
> >> +        /* slide the window of what the other end can send */
> >> +        if (SEQ_GEQ(ack + (win << sws), dst->seqhi)) {
> >> +            dst->seqhi = ack + MAX((win << sws), 1);
> >> +        }
> >> +
> >> +        /* update states */
> >> +        if (tcp_flags & TCP_SYN && src->state <
> >> CT_DPIF_TCPS_SYN_SENT) {
> >> +                src->state = CT_DPIF_TCPS_SYN_SENT;
> >> +        }
> >> +        if (tcp_flags & TCP_FIN && src->state <
> >> CT_DPIF_TCPS_CLOSING) {
> >> +                src->state = CT_DPIF_TCPS_CLOSING;
> >> +        }
> >> +        if (tcp_flags & TCP_ACK) {
> >> +            if (dst->state == CT_DPIF_TCPS_SYN_SENT) {
> >> +                dst->state = CT_DPIF_TCPS_ESTABLISHED;
> >> +            } else if (dst->state == CT_DPIF_TCPS_CLOSING) {
> >> +                dst->state = CT_DPIF_TCPS_FIN_WAIT_2;
> >> +            }
> >> +        }
> >> +        if (tcp_flags & TCP_RST) {
> >> +            src->state = dst->state = CT_DPIF_TCPS_TIME_WAIT;
> >> +        }
> >> +
> >> +        if (src->state >= CT_DPIF_TCPS_FIN_WAIT_2
> >> +            && dst->state >= CT_DPIF_TCPS_FIN_WAIT_2) {
> >> +            update_expiration(conn_, CT_TM_TCP_CLOSED, now);
> >> +        } else if (src->state >= CT_DPIF_TCPS_CLOSING
> >> +                   && dst->state >= CT_DPIF_TCPS_CLOSING) {
> >> +            update_expiration(conn_, CT_TM_TCP_FIN_WAIT, now);
> >> +        } else if (src->state < CT_DPIF_TCPS_ESTABLISHED
> >> +                   || dst->state < CT_DPIF_TCPS_ESTABLISHED) {
> >> +            update_expiration(conn_, now, CT_TM_TCP_OPENING);
> >> +        } else if (src->state >= CT_DPIF_TCPS_CLOSING
> >> +                   || dst->state >= CT_DPIF_TCPS_CLOSING) {
> >> +            update_expiration(conn_, now, CT_TM_TCP_CLOSING);
> >> +        } else {
> >> +            update_expiration(conn_, now, CT_TM_TCP_ESTABLISHED);
> >> +        }
> >> +    } else if ((dst->state < CT_DPIF_TCPS_SYN_SENT
> >> +                || dst->state >= CT_DPIF_TCPS_FIN_WAIT_2
> >> +                || src->state >= CT_DPIF_TCPS_FIN_WAIT_2)
> >> +               && SEQ_GEQ(src->seqhi + MAXACKWINDOW, end)
> >> +               /* Within a window forward of the originating
> packet
> >> */
> >> +               && SEQ_GEQ(seq, src->seqlo - MAXACKWINDOW)) {
> >> +               /* Within a window backward of the originating
> packet
> >> */
> >> +
> >> +        /*
> >> +         * This currently handles three situations:
> >> +         *  1) Stupid stacks will shotgun SYNs before their peer
> >> +         *     replies.
> >> +         *  2) When PF catches an already established stream (the
> >> +         *     firewall rebooted, the state table was flushed,
> >> routes
> >> +         *     changed...)
> >> +         *  3) Packets get funky immediately after the connection
> >> +         *     closes (this should catch Solaris spurious
> ACK|FINs
> >> +         *     that web servers like to spew after a close)
> >> +         *
> >> +         * This must be a little more careful than the above code
> >> +         * since packet floods will also be caught here. We don't
> >> +         * update the TTL here to mitigate the damage of a packet
> >> +         * flood and so the same code can handle awkward
> >> establishment
> >> +         * and a loosened connection close.
> >> +         * In the establishment case, a correct peer response
> will
> >> +         * validate the connection, go through the normal state
> code
> >> +         * and keep updating the state TTL.
> >> +         */
> >> +
> >> +        /* update max window */
> >> +        if (src->max_win < win) {
> >> +            src->max_win = win;
> >> +        }
> >> +        /* synchronize sequencing */
> >> +        if (SEQ_GT(end, src->seqlo)) {
> >> +            src->seqlo = end;
> >> +        }
> >> +        /* slide the window of what the other end can send */
> >> +        if (SEQ_GEQ(ack + (win << sws), dst->seqhi)) {
> >> +            dst->seqhi = ack + MAX((win << sws), 1);
> >> +        }
> >> +
> >> +        /*
> >> +         * Cannot set dst->seqhi here since this could be a
> >> shotgunned
> >> +         * SYN and not an already established connection.
> >> +         */
> >> +
> >> +        if (tcp_flags & TCP_FIN && src->state <
> >> CT_DPIF_TCPS_CLOSING) {
> >> +            src->state = CT_DPIF_TCPS_CLOSING;
> >> +        }
> >> +
> >> +        if (tcp_flags & TCP_RST) {
> >> +            src->state = dst->state = CT_DPIF_TCPS_TIME_WAIT;
> >> +        }
> >> +    } else {
> >> +        return CT_UPDATE_INVALID;
> >> +    }
> >> +
> >> +    return CT_UPDATE_VALID;
> >> +}
> >> +
> >> +static bool
> >> +tcp_valid_new(struct dp_packet *pkt)
> >> +{
> >> +    struct tcp_header *tcp = dp_packet_l4(pkt);
> >> +    uint16_t tcp_flags = TCP_FLAGS(tcp->tcp_ctl);
> >> +
> >> +    if (tcp_invalid_flags(tcp_flags)) {
> >> +        return false;
> >> +    }
> >> +
> >> +    /* A syn+ack is not allowed to create a connection.  We want
> to
> >> allow
> >> +     * totally new connections (syn) or already established, not
> >> partially
> >> +     * open (syn+ack). */
> >> +    if ((tcp_flags & TCP_SYN) && (tcp_flags & TCP_ACK)) {
> >> +        return false;
> >> +    }
> >> +
> >> +    return true;
> >> +}
> >> +
> >> +static struct conn *
> >> +tcp_new_conn(struct dp_packet *pkt, long long now)
> >> +{
> >> +    struct conn_tcp* newconn = NULL;
> >> +    struct tcp_header *tcp = dp_packet_l4(pkt);
> >> +    struct tcp_peer *src, *dst;
> >> +    uint16_t tcp_flags = TCP_FLAGS(tcp->tcp_ctl);
> >> +
> >> +    newconn = xzalloc(sizeof *newconn);
> >> +
> >> +    src = &newconn->peer[0];
> >> +    dst = &newconn->peer[1];
> >> +
> >> +    src->seqlo = ntohl(get_16aligned_be32(&tcp->tcp_seq));
> >> +    src->seqhi = src->seqlo + tcp_payload_length(pkt) + 1;
> >> +
> >> +    if (tcp_flags & TCP_SYN) {
> >> +        src->seqhi++;
> >> +        src->wscale = tcp_get_wscale(tcp);
> >> +    } else {
> >> +        src->wscale = CT_WSCALE_UNKNOWN;
> >> +        dst->wscale = CT_WSCALE_UNKNOWN;
> >> +    }
> >> +    src->max_win = MAX(ntohs(tcp->tcp_winsz), 1);
> >> +    if (src->wscale & CT_WSCALE_MASK) {
> >> +        /* Remove scale factor from initial window */
> >> +        uint8_t sws = src->wscale & CT_WSCALE_MASK;
> >> +        src->max_win = DIV_ROUND_UP((uint32_t) src->max_win, 1 <<
> >> sws);
> >> +    }
> >> +    if (tcp_flags & TCP_FIN) {
> >> +        src->seqhi++;
> >> +    }
> >> +    dst->seqhi = 1;
> >> +    dst->max_win = 1;
> >> +    src->state = CT_DPIF_TCPS_SYN_SENT;
> >> +    dst->state = CT_DPIF_TCPS_CLOSED;
> >> +
> >> +    update_expiration(&newconn->up, now, CT_TM_TCP_FIRST_PACKET);
> >> +
> >> +    return &newconn->up;
> >> +}
> >> +
> >> +struct ct_l4_proto ct_proto_tcp = {
> >> +    .new_conn = tcp_new_conn,
> >> +    .valid_new = tcp_valid_new,
> >> +    .conn_update = tcp_conn_update,
> >> +};
> >> diff --git a/lib/conntrack.c b/lib/conntrack.c
> >> new file mode 100644
> >> index 0000000..e282485
> >> --- /dev/null
> >> +++ b/lib/conntrack.c
> >> @@ -0,0 +1,883 @@
> >> +/*
> >> + * Copyright (c) 2015, 2016 Nicira, Inc.
> >> + *
> >> + * Licensed under the Apache License, Version 2.0 (the
> "License");
> >> + * you may not use this file except in compliance with the
> License.
> >> + * You may obtain a copy of the License at:
> >> + *
> >> + *     http://www.apache.org/licenses/LICENSE-2.0
> >> + *
> >> + * Unless required by applicable law or agreed to in writing,
> >> software
> >> + * distributed under the License is distributed on an "AS IS"
> BASIS,
> >> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
> or
> >> implied.
> >> + * See the License for the specific language governing
> permissions
> >> and
> >> + * limitations under the License.
> >> + */
> >> +
> >> +#include <config.h>
> >> +#include "conntrack.h"
> >> +
> >> +#include <errno.h>
> >> +#include <sys/types.h>
> >> +#include <netinet/in.h>
> >> +#include <netinet/icmp6.h>
> >> +
> >> +#include "bitmap.h"
> >> +#include "conntrack-private.h"
> >> +#include "coverage.h"
> >> +#include "csum.h"
> >> +#include "dp-packet.h"
> >> +#include "flow.h"
> >> +#include "hmap.h"
> >> +#include "netdev.h"
> >> +#include "odp-netlink.h"
> >> +#include "openvswitch/vlog.h"
> >> +#include "ovs-rcu.h"
> >> +#include "random.h"
> >> +#include "timeval.h"
> >> +
> >> +VLOG_DEFINE_THIS_MODULE(conntrack);
> >> +
> >> +COVERAGE_DEFINE(conntrack_new_full);
> >> +
> >> +struct conn_lookup_ctx {
> >> +    struct conn_key key;
> >> +    struct conn *conn;
> >> +    uint32_t hash;
> >> +    bool reply;
> >> +    bool related;
> >> +};
> >> +
> >> +static bool conn_key_extract(struct conntrack *, struct dp_packet
> *,
> >> +                             struct conn_lookup_ctx *, uint16_t
> >> zone);
> >> +static uint32_t conn_key_hash(const struct conn_key *, uint32_t
> >> basis);
> >> +static void conn_key_reverse(struct conn_key *);
> >> +static void conn_key_lookup(struct conntrack_bucket *ctb,
> >> +                            struct conn_lookup_ctx *ctx,
> >> +                            long long now);
> >> +static bool valid_new(struct dp_packet *pkt, struct conn_key *);
> >> +static struct conn *new_conn(struct dp_packet *pkt, struct
> conn_key
> >> *,
> >> +                             long long now);
> >> +static void delete_conn(struct conn *);
> >> +static enum ct_update_res conn_update(struct conn *, struct
> >> dp_packet*,
> >> +                                      bool reply, long long now);
> >> +static bool conn_expired(struct conn *, long long now);
> >> +static void set_mark(struct dp_packet *, struct conn *,
> >> +                     uint32_t val, uint32_t mask);
> >> +static void set_label(struct dp_packet *, struct conn *,
> >> +                      const struct ovs_key_ct_labels *val,
> >> +                      const struct ovs_key_ct_labels *mask);
> >> +
> >> +static struct ct_l4_proto *l4_protos[] = {
> >> +    [IPPROTO_TCP] = &ct_proto_tcp,
> >> +    [IPPROTO_UDP] = &ct_proto_other,
> >> +    [IPPROTO_ICMP] = &ct_proto_other,
> >> +    [IPPROTO_ICMPV6] = &ct_proto_other,
> >> +};
> >> +
> >> +long long ct_timeout_val[] = {
> >> +#define CT_TIMEOUT(NAME, VAL) [CT_TM_##NAME] = VAL,
> >> +    CT_TIMEOUTS
> >> +#undef CT_TIMEOUT
> >> +};
> >> +
> >> +/* If the total number of connections goes above this value, no
> new
> >> connections
> >> + * are accepted */
> >> +#define DEFAULT_N_CONN_LIMIT 3000000
> >> +
> >> +/* Initializes the connection tracker 'ct'.  The caller is
> >> responbile for
> >> + * calling 'conntrack_destroy()', when the instance is not needed
> >> anymore */
> >> +void
> >> +conntrack_init(struct conntrack *ct)
> >> +{
> >> +    unsigned i;
> >> +
> >> +    for (i = 0; i < CONNTRACK_BUCKETS; i++) {
> >> +        struct conntrack_bucket *ctb = &ct->buckets[i];
> >> +
> >> +        ct_lock_init(&ctb->lock);
> >> +        ct_lock_lock(&ctb->lock);
> >> +        hmap_init(&ctb->connections);
> >> +        ct_lock_unlock(&ctb->lock);
> >> +    }
> >> +    ct->hash_basis = random_uint32();
> >> +    atomic_count_init(&ct->n_conn, 0);
> >> +    atomic_init(&ct->n_conn_limit, DEFAULT_N_CONN_LIMIT);
> >> +}
> >> +
> >> +/* Destroys the connection tracker 'ct' and frees all the
> allocated
> >> memory. */
> >> +void
> >> +conntrack_destroy(struct conntrack *ct)
> >> +{
> >> +    unsigned i;
> >> +
> >> +    for (i = 0; i < CONNTRACK_BUCKETS; i++) {
> >> +        struct conntrack_bucket *ctb = &ct->buckets[i];
> >> +        struct conn *conn;
> >> +
> >> +        ct_lock_lock(&ctb->lock);
> >> +        HMAP_FOR_EACH_POP(conn, node, &ctb->connections) {
> >> +            atomic_count_dec(&ct->n_conn);
> >> +            delete_conn(conn);
> >> +        }
> >> +        hmap_destroy(&ctb->connections);
> >> +        ct_lock_unlock(&ctb->lock);
> >> +        ct_lock_destroy(&ctb->lock);
> >> +    }
> >> +}
> >> +
> >
> >> +static unsigned hash_to_bucket(uint32_t hash)
> >> +{
> >> +    /* Extracts the most significant bits in hash. The least
> >> significant bits
> >> +     * are already used internally by the hmap implementation. */
> >> +    BUILD_ASSERT(CONNTRACK_BUCKETS_SHIFT < 32 &&
> >> CONNTRACK_BUCKETS_SHIFT >= 1);
> >> +
> >> +    return (hash >> (32 - CONNTRACK_BUCKETS_SHIFT)) %
> >> CONNTRACK_BUCKETS;
> >> +}
> >> +
> >> +static void
> >> +write_ct_md(struct dp_packet *pkt, uint16_t state, uint16_t zone,
> >> +            uint32_t mark, ovs_u128 label)
> >> +{
> >> +    pkt->md.ct_state = state | CS_TRACKED;
> >> +    pkt->md.ct_zone = zone;
> >> +    pkt->md.ct_mark = mark;
> >> +    pkt->md.ct_label = label;
> >> +}
> >> +
> >> +static struct conn *
> >> +conn_not_found(struct conntrack *ct, struct dp_packet *pkt,
> >> +               struct conn_lookup_ctx *ctx, uint16_t *state, bool
> >> commit,
> >> +               long long now)
> >> +{
> >> +    unsigned bucket = hash_to_bucket(ctx->hash);
> >> +    struct conn *nc = NULL;
> >> +
> >> +    if (!valid_new(pkt, &ctx->key)) {
> >> +        *state |= CS_INVALID;
> >> +        return nc;
> >> +    }
> >> +
> >> +    *state |= CS_NEW;
> >> +
> >> +    if (commit) {
> >> +        unsigned int n_conn_limit;
> >> +
> >> +        atomic_read_relaxed(&ct->n_conn_limit, &n_conn_limit);
> >> +
> >> +        if (atomic_count_get(&ct->n_conn) >= n_conn_limit) {
> >> +            COVERAGE_INC(conntrack_new_full);
> >> +            return nc;
> >> +        }
> >> +
> >> +        nc = new_conn(pkt, &ctx->key, now);
> >> +
> >> +        memcpy(&nc->rev_key, &ctx->key, sizeof nc->rev_key);
> >> +
> >> +        conn_key_reverse(&nc->rev_key);
> >> +        hmap_insert(&ct->buckets[bucket].connections, &nc->node,
> >> ctx->hash);
> >> +        atomic_count_inc(&ct->n_conn);
> >> +    }
> >> +
> >> +    return nc;
> >> +}
> >> +
> >> +static struct conn *
> >> +process_one(struct conntrack *ct, struct dp_packet *pkt,
> >> +            struct conn_lookup_ctx *ctx, uint16_t zone,
> >> +            bool commit, long long now)
> >> +{
> >> +    unsigned bucket = hash_to_bucket(ctx->hash);
> >> +    struct conn *conn = ctx->conn;
> >> +    uint16_t state = 0;
> >> +
> >> +    if (conn) {
> >> +        if (ctx->related) {
> >> +            state |= CS_RELATED;
> >> +            if (ctx->reply) {
> >> +                state |= CS_REPLY_DIR;
> >> +            }
> >> +        } else {
> >> +            enum ct_update_res res;
> >> +
> >> +            res = conn_update(conn, pkt, ctx->reply, now);
> >> +
> >> +            switch (res) {
> >> +            case CT_UPDATE_VALID:
> >> +                state |= CS_ESTABLISHED;
> >> +                if (ctx->reply) {
> >> +                    state |= CS_REPLY_DIR;
> >> +                }
> >> +                break;
> >> +            case CT_UPDATE_INVALID:
> >> +                state |= CS_INVALID;
> >> +                break;
> >> +            case CT_UPDATE_NEW:
> >> +                hmap_remove(&ct->buckets[bucket].connections,
> &conn-
> >> >node);
> >> +                atomic_count_dec(&ct->n_conn);
> >> +                delete_conn(conn);
> >> +                conn = conn_not_found(ct, pkt, ctx, &state,
> commit,
> >> now);
> >> +                break;
> >> +            }
> >
> > [Antonio F] Sorry to repeat, but I'd prefer to add the 'default'
> > case, here. I mean something like
> >
> > default:
> >     state |= CS_INVALID;
> >     break;
> >
> > I know if we add new items to enum ct_update_res we can get a
> > warning from the compiler, but I wouldn't rely on that.
> 
> If we're going to rely upon a default case that we don't expect to
> hit, we should consider calling OVS_NOT_REACHED() rather than
> treating the
> traffic as invalid; this would be easier to track down than quietly
> marking some traffic as invalid.

[Antonio F] Both options looks fine to me, I just wanted to point out 
that it should be better to add a default case to cover any possible value.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to