On 24 May 2016 at 07:19, Fischetti, Antonio <antonio.fische...@intel.com> wrote: > Hi Daniele, just a comment below. > Apart from that, it looks good to me, thanks. > > Acked-by: Antonio Fischetti <antonio.fische...@intel.com> > >> -----Original Message----- >> From: dev [mailto:dev-boun...@openvswitch.org] On Behalf Of Daniele >> Di Proietto >> Sent: Tuesday, May 17, 2016 1:56 AM >> To: dev@openvswitch.org >> Subject: [ovs-dev] [PATCH v3 04/16] conntrack: New userspace >> connection tracker. >> >> This commit adds the conntrack module. >> >> It is a connection tracker that resides entirely in userspace. Its >> primary user will be the dpif-netdev datapath. >> >> The module main goal is to provide conntrack_execute(), which offers >> a >> convenient interface to implement the datapath ct() action. >> >> The conntrack module uses two submodules to deal with the l4 protocol >> details (conntrack-other for UDP and ICMP, conntrack-tcp for TCP). >> >> The conntrack-tcp submodule implementation is adapted from FreeBSD's >> pf >> subsystem, therefore it's BSD licensed. It has been slightly altered >> to >> match the OVS coding style and to allow the pickup of already >> established connections. >> >> Signed-off-by: Daniele Di Proietto <diproiet...@vmware.com> >> --- >> COPYING | 1 + >> debian/copyright.in | 4 + >> include/openvswitch/types.h | 4 + >> lib/automake.mk | 5 + >> lib/conntrack-other.c | 85 +++++ >> lib/conntrack-private.h | 88 +++++ >> lib/conntrack-tcp.c | 463 +++++++++++++++++++++++ >> lib/conntrack.c | 883 >> ++++++++++++++++++++++++++++++++++++++++++++ >> lib/conntrack.h | 151 ++++++++ >> lib/util.h | 9 + >> 10 files changed, 1693 insertions(+) >> create mode 100644 lib/conntrack-other.c >> create mode 100644 lib/conntrack-private.h >> create mode 100644 lib/conntrack-tcp.c >> create mode 100644 lib/conntrack.c >> create mode 100644 lib/conntrack.h >> >> diff --git a/COPYING b/COPYING >> index 308e3ea..afb98b9 100644 >> --- a/COPYING >> +++ b/COPYING >> @@ -25,6 +25,7 @@ License, version 2. >> The following files are licensed under the 2-clause BSD license. >> include/windows/getopt.h >> lib/getopt_long.c >> + lib/conntrack-tcp.c >> >> The following files are licensed under the 3-clause BSD-license >> include/windows/netinet/icmp6.h >> diff --git a/debian/copyright.in b/debian/copyright.in >> index 57d007a..a15f4dd 100644 >> --- a/debian/copyright.in >> +++ b/debian/copyright.in >> @@ -21,6 +21,9 @@ Upstream Copyright Holders: >> Copyright (c) 2014 Michael Chapman >> Copyright (c) 2014 WindRiver, Inc. >> Copyright (c) 2014 Avaya, Inc. >> + Copyright (c) 2001 Daniel Hartmeier >> + Copyright (c) 2002 - 2008 Henning Brauer >> + Copyright (c) 2012 Gleb Smirnoff <gleb...@freebsd.org> >> >> License: >> >> @@ -90,6 +93,7 @@ License: >> lib/getopt_long.c >> include/windows/getopt.h >> datapath-windows/ovsext/Conntrack-tcp.c >> + lib/conntrack-tcp.c >> >> * The following files are licensed under the 3-clause BSD-license >> >> diff --git a/include/openvswitch/types.h >> b/include/openvswitch/types.h >> index 5f3347d..d7e94a6 100644 >> --- a/include/openvswitch/types.h >> +++ b/include/openvswitch/types.h >> @@ -107,6 +107,10 @@ static const ovs_u128 OVS_U128_MAX = { { >> UINT32_MAX, UINT32_MAX, >> UINT32_MAX, UINT32_MAX } }; >> static const ovs_be128 OVS_BE128_MAX OVS_UNUSED = { { OVS_BE32_MAX, >> OVS_BE32_MAX, >> OVS_BE32_MAX, >> OVS_BE32_MAX } }; >> +static const ovs_u128 OVS_U128_MIN OVS_UNUSED = { {0, 0, 0, 0} }; >> +static const ovs_u128 OVS_BE128_MIN OVS_UNUSED = { {0, 0, 0, 0} }; >> + >> +#define OVS_U128_ZERO OVS_U128_MIN >> >> /* A 64-bit value, in network byte order, that is only aligned on a >> 32-bit >> * boundary. */ >> diff --git a/lib/automake.mk b/lib/automake.mk >> index affbb5c..df8b07d 100644 >> --- a/lib/automake.mk >> +++ b/lib/automake.mk >> @@ -47,6 +47,11 @@ lib_libopenvswitch_la_SOURCES = \ >> lib/compiler.h \ >> lib/connectivity.c \ >> lib/connectivity.h \ >> + lib/conntrack-private.h \ >> + lib/conntrack-tcp.c \ >> + lib/conntrack-other.c \ >> + lib/conntrack.c \ >> + lib/conntrack.h \ >> lib/coverage.c \ >> lib/coverage.h \ >> lib/crc32c.c \ >> diff --git a/lib/conntrack-other.c b/lib/conntrack-other.c >> new file mode 100644 >> index 0000000..295cb2c >> --- /dev/null >> +++ b/lib/conntrack-other.c >> @@ -0,0 +1,85 @@ >> +/* >> + * Copyright (c) 2015, 2016 Nicira, Inc. >> + * >> + * Licensed under the Apache License, Version 2.0 (the "License"); >> + * you may not use this file except in compliance with the License. >> + * You may obtain a copy of the License at: >> + * >> + * http://www.apache.org/licenses/LICENSE-2.0 >> + * >> + * Unless required by applicable law or agreed to in writing, >> software >> + * distributed under the License is distributed on an "AS IS" BASIS, >> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or >> implied. >> + * See the License for the specific language governing permissions >> and >> + * limitations under the License. >> + */ >> + >> +#include <config.h> >> + >> +#include "conntrack-private.h" >> +#include "dp-packet.h" >> + >> +enum other_state { >> + OTHERS_FIRST, >> + OTHERS_MULTIPLE, >> + OTHERS_BIDIR, >> +}; >> + >> +struct conn_other { >> + struct conn up; >> + enum other_state state; >> +}; >> + >> +static const enum ct_timeout other_timeouts[] = { >> + [OTHERS_FIRST] = CT_TM_OTHER_FIRST, >> + [OTHERS_MULTIPLE] = CT_TM_OTHER_MULTIPLE, >> + [OTHERS_BIDIR] = CT_TM_OTHER_BIDIR, >> +}; >> + >> +static struct conn_other * >> +conn_other_cast(const struct conn *conn) >> +{ >> + return CONTAINER_OF(conn, struct conn_other, up); >> +} >> + >> +static enum ct_update_res >> +other_conn_update(struct conn *conn_, struct dp_packet *pkt >> OVS_UNUSED, >> + bool reply, long long now) >> +{ >> + struct conn_other *conn = conn_other_cast(conn_); >> + >> + if (reply && conn->state != OTHERS_BIDIR) { >> + conn->state = OTHERS_BIDIR; >> + } else if (conn->state == OTHERS_FIRST) { >> + conn->state = OTHERS_MULTIPLE; >> + } >> + >> + update_expiration(conn_, other_timeouts[conn->state], now); >> + >> + return CT_UPDATE_VALID; >> +} >> + >> +static bool >> +other_valid_new(struct dp_packet *pkt OVS_UNUSED) >> +{ >> + return true; >> +} >> + >> +static struct conn * >> +other_new_conn(struct dp_packet *pkt OVS_UNUSED, long long now) >> +{ >> + struct conn_other *conn; >> + >> + conn = xzalloc(sizeof *conn); >> + conn->state = OTHERS_FIRST; >> + >> + update_expiration(&conn->up, other_timeouts[conn->state], now); >> + >> + return &conn->up; >> +} >> + >> +struct ct_l4_proto ct_proto_other = { >> + .new_conn = other_new_conn, >> + .valid_new = other_valid_new, >> + .conn_update = other_conn_update, >> +}; >> diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h >> new file mode 100644 >> index 0000000..d3e0099 >> --- /dev/null >> +++ b/lib/conntrack-private.h >> @@ -0,0 +1,88 @@ >> +/* >> + * Copyright (c) 2015, 2016 Nicira, Inc. >> + * >> + * Licensed under the Apache License, Version 2.0 (the "License"); >> + * you may not use this file except in compliance with the License. >> + * You may obtain a copy of the License at: >> + * >> + * http://www.apache.org/licenses/LICENSE-2.0 >> + * >> + * Unless required by applicable law or agreed to in writing, >> software >> + * distributed under the License is distributed on an "AS IS" BASIS, >> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or >> implied. >> + * See the License for the specific language governing permissions >> and >> + * limitations under the License. >> + */ >> + >> +#ifndef CONNTRACK_PRIVATE_H >> +#define CONNTRACK_PRIVATE_H 1 >> + >> +#include <sys/types.h> >> +#include <netinet/in.h> >> +#include <netinet/ip6.h> >> + >> +#include "conntrack.h" >> +#include "hmap.h" >> +#include "openvswitch/list.h" >> +#include "openvswitch/types.h" >> +#include "packets.h" >> +#include "unaligned.h" >> + >> +struct ct_addr { >> + union { >> + ovs_16aligned_be32 ipv4; >> + union ovs_16aligned_in6_addr ipv6; >> + ovs_be32 ipv4_aligned; >> + struct in6_addr ipv6_aligned; >> + }; >> +}; >> + >> +struct ct_endpoint { >> + struct ct_addr addr; >> + ovs_be16 port; >> +}; >> + >> +struct conn_key { >> + struct ct_endpoint src; >> + struct ct_endpoint dst; >> + >> + ovs_be16 dl_type; >> + uint8_t nw_proto; >> + uint16_t zone; >> +}; >> + >> +struct conn { >> + struct conn_key key; >> + struct conn_key rev_key; >> + long long expiration; >> + struct ovs_list exp_node; >> + struct hmap_node node; >> + uint32_t mark; >> + ovs_u128 label; >> +}; >> + >> +enum ct_update_res { >> + CT_UPDATE_INVALID, >> + CT_UPDATE_VALID, >> + CT_UPDATE_NEW, >> +}; >> + >> +struct ct_l4_proto { >> + struct conn *(*new_conn)(struct dp_packet *pkt, long long now); >> + bool (*valid_new)(struct dp_packet *pkt); >> + enum ct_update_res (*conn_update)(struct conn *conn, struct >> dp_packet *pkt, >> + bool reply, long long now); >> +}; >> + >> +extern struct ct_l4_proto ct_proto_tcp; >> +extern struct ct_l4_proto ct_proto_other; >> + >> +extern long long ct_timeout_val[]; >> + >> +static inline void >> +update_expiration(struct conn *conn, enum ct_timeout tm, long long >> now) >> +{ >> + conn->expiration = now + ct_timeout_val[tm]; >> +} >> + >> +#endif /* conntrack-private.h */ >> diff --git a/lib/conntrack-tcp.c b/lib/conntrack-tcp.c >> new file mode 100644 >> index 0000000..4d3d9c3 >> --- /dev/null >> +++ b/lib/conntrack-tcp.c >> @@ -0,0 +1,463 @@ >> +/*- >> + * Copyright (c) 2001 Daniel Hartmeier >> + * Copyright (c) 2002 - 2008 Henning Brauer >> + * Copyright (c) 2012 Gleb Smirnoff <gleb...@freebsd.org> >> + * Copyright (c) 2015, 2016 Nicira, Inc. >> + * All rights reserved. >> + * >> + * Redistribution and use in source and binary forms, with or >> without >> + * modification, are permitted provided that the following >> conditions >> + * are met: >> + * >> + * - Redistributions of source code must retain the above >> copyright >> + * notice, this list of conditions and the following >> disclaimer. >> + * - Redistributions in binary form must reproduce the above >> + * copyright notice, this list of conditions and the following >> + * disclaimer in the documentation and/or other materials >> provided >> + * with the distribution. >> + * >> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND >> CONTRIBUTORS >> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT >> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS >> + * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE >> + * COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, >> INDIRECT, >> + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES >> (INCLUDING, >> + * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; >> + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER >> + * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, >> STRICT >> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN >> + * ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE >> + * POSSIBILITY OF SUCH DAMAGE. >> + * >> + * Effort sponsored in part by the Defense Advanced Research >> Projects >> + * Agency (DARPA) and Air Force Research Laboratory, Air Force >> + * Materiel Command, USAF, under agreement number F30602-01-2-0537. >> + * >> + * $OpenBSD: pf.c,v 1.634 2009/02/27 12:37:45 henning Exp $ >> + */ >> + >> +#include <config.h> >> + >> +#include "conntrack-private.h" >> +#include "ct-dpif.h" >> +#include "dp-packet.h" >> +#include "util.h" >> + >> +struct tcp_peer { >> + enum ct_dpif_tcp_state state; >> + uint32_t seqlo; /* Max sequence number >> sent */ >> + uint32_t seqhi; /* Max the other end ACKd >> + win */ >> + uint16_t max_win; /* largest window (pre >> scaling) */ >> + uint8_t wscale; /* window scaling factor >> */ >> +}; >> + >> +struct conn_tcp { >> + struct conn up; >> + struct tcp_peer peer[2]; >> +}; >> + >> +enum { >> + TCPOPT_EOL, >> + TCPOPT_NOP, >> + TCPOPT_WINDOW = 3, >> +}; >> + >> +/* TCP sequence numbers are 32 bit integers operated >> + * on with modular arithmetic. These macros can be >> + * used to compare such integers. */ >> +#define SEQ_LT(a,b) INT_MOD_LT(a, b) >> +#define SEQ_LEQ(a,b) INT_MOD_LEQ(a, b) >> +#define SEQ_GT(a,b) INT_MOD_GT(a, b) >> +#define SEQ_GEQ(a,b) INT_MOD_GEQ(a, b) >> + >> +#define SEQ_MIN(a, b) INT_MOD_MIN(a, b) >> +#define SEQ_MAX(a, b) INT_MOD_MAX(a, b) >> + >> +static struct conn_tcp* >> +conn_tcp_cast(const struct conn* conn) >> +{ >> + return CONTAINER_OF(conn, struct conn_tcp, up); >> +} >> + >> +/* pf does this in in pf_normalize_tcp(), and it is called only if >> scrub >> + * is enabled. We're not scrubbing, but this check seems >> reasonable. */ >> +static bool >> +tcp_invalid_flags(uint16_t flags) >> +{ >> + >> + if (flags & TCP_SYN) { >> + if (flags & TCP_RST || flags & TCP_FIN) { >> + return true; >> + } >> + } else { >> + /* Illegal packet */ >> + if (!(flags & (TCP_ACK|TCP_RST))) { >> + return true; >> + } >> + } >> + >> + if (!(flags & TCP_ACK)) { >> + /* These flags are only valid if ACK is set */ >> + if ((flags & TCP_FIN) || (flags & TCP_PSH) || (flags & >> TCP_URG)) { >> + return true; >> + } >> + } >> + >> + return false; >> +} >> + >> +#define TCP_MAX_WSCALE 14 >> +#define CT_WSCALE_FLAG 0x80 >> +#define CT_WSCALE_UNKNOWN 0x40 >> +#define CT_WSCALE_MASK 0xf >> + >> +static uint8_t >> +tcp_get_wscale(const struct tcp_header *tcp) >> +{ >> + int len = TCP_OFFSET(tcp->tcp_ctl) * 4 - sizeof *tcp; >> + const uint8_t *opt = (const uint8_t *)(tcp + 1); >> + uint8_t wscale = 0; >> + uint8_t optlen; >> + >> + while (len >= 3) { >> + switch (*opt) { >> + case TCPOPT_EOL: >> + return wscale; >> + case TCPOPT_NOP: >> + opt++; >> + len--; >> + break; >> + case TCPOPT_WINDOW: >> + wscale = MIN(opt[2], TCP_MAX_WSCALE); >> + wscale |= CT_WSCALE_FLAG; >> + /* fall through */ >> + default: >> + optlen = opt[1]; >> + if (optlen < 2) { >> + optlen = 2; >> + } >> + len -= optlen; >> + opt += optlen; >> + } >> + } >> + >> + return wscale; >> +} >> + >> +static uint32_t >> +tcp_payload_length(struct dp_packet *pkt) >> +{ >> + return (char *) dp_packet_tail(pkt) - dp_packet_l2_pad_size(pkt) >> + - (char *) dp_packet_get_tcp_payload(pkt); >> +} >> + >> +static enum ct_update_res >> +tcp_conn_update(struct conn* conn_, struct dp_packet *pkt, bool >> reply, >> + long long now) >> +{ >> + struct conn_tcp *conn = conn_tcp_cast(conn_); >> + struct tcp_header *tcp = dp_packet_l4(pkt); >> + /* The peer that sent 'pkt' */ >> + struct tcp_peer *src = &conn->peer[reply ? 1 : 0]; >> + /* The peer that should receive 'pkt' */ >> + struct tcp_peer *dst = &conn->peer[reply ? 0 : 1]; >> + uint8_t sws = 0, dws = 0; >> + uint16_t tcp_flags = TCP_FLAGS(tcp->tcp_ctl); >> + >> + uint16_t win = ntohs(tcp->tcp_winsz); >> + uint32_t ack, end, seq, orig_seq; >> + uint32_t p_len = tcp_payload_length(pkt); >> + int ackskew; >> + >> + if (tcp_invalid_flags(tcp_flags)) { >> + return CT_UPDATE_INVALID; >> + } >> + >> + if (((tcp_flags & (TCP_SYN|TCP_ACK)) == TCP_SYN) && >> + dst->state >= CT_DPIF_TCPS_FIN_WAIT_2 && >> + src->state >= CT_DPIF_TCPS_FIN_WAIT_2) { >> + src->state = dst->state = CT_DPIF_TCPS_CLOSED; >> + return CT_UPDATE_NEW; >> + } >> + >> + if (src->wscale & CT_WSCALE_FLAG >> + && dst->wscale & CT_WSCALE_FLAG >> + && !(tcp_flags & TCP_SYN)) { >> + >> + sws = src->wscale & CT_WSCALE_MASK; >> + dws = dst->wscale & CT_WSCALE_MASK; >> + >> + } else if (src->wscale & CT_WSCALE_UNKNOWN >> + && dst->wscale & CT_WSCALE_UNKNOWN >> + && !(tcp_flags & TCP_SYN)) { >> + >> + sws = TCP_MAX_WSCALE; >> + dws = TCP_MAX_WSCALE; >> + } >> + >> + /* >> + * Sequence tracking algorithm from Guido van Rooij's paper: >> + * http://www.madison-gurkha.com/publications/tcp_filtering/ >> + * tcp_filtering.ps >> + */ >> + >> + orig_seq = seq = ntohl(get_16aligned_be32(&tcp->tcp_seq)); >> + if (src->state < CT_DPIF_TCPS_SYN_SENT) { >> + /* First packet from this end. Set its state */ >> + >> + ack = ntohl(get_16aligned_be32(&tcp->tcp_ack)); >> + >> + end = seq + p_len; >> + if (tcp_flags & TCP_SYN) { >> + end++; >> + if (dst->wscale & CT_WSCALE_FLAG) { >> + src->wscale = tcp_get_wscale(tcp); >> + if (src->wscale & CT_WSCALE_FLAG) { >> + /* Remove scale factor from initial window */ >> + sws = src->wscale & CT_WSCALE_MASK; >> + win = DIV_ROUND_UP((uint32_t) win, 1 << sws); >> + dws = dst->wscale & CT_WSCALE_MASK; >> + } else { >> + /* fixup other window */ >> + dst->max_win <<= dst->wscale & >> + CT_WSCALE_MASK; >> + /* in case of a retrans SYN|ACK */ >> + dst->wscale = 0; >> + } >> + } >> + } >> + if (tcp_flags & TCP_FIN) { >> + end++; >> + } >> + >> + src->seqlo = seq; >> + src->state = CT_DPIF_TCPS_SYN_SENT; >> + /* >> + * May need to slide the window (seqhi may have been set by >> + * the crappy stack check or if we picked up the connection >> + * after establishment) >> + */ >> + if (src->seqhi == 1 || >> + SEQ_GEQ(end + MAX(1, dst->max_win << dws), src- >> >seqhi)) { >> + src->seqhi = end + MAX(1, dst->max_win << dws); >> + } >> + if (win > src->max_win) { >> + src->max_win = win; >> + } >> + >> + } else { >> + ack = ntohl(get_16aligned_be32(&tcp->tcp_ack)); >> + end = seq + p_len; >> + if (tcp_flags & TCP_SYN) { >> + end++; >> + } >> + if (tcp_flags & TCP_FIN) { >> + end++; >> + } >> + } >> + >> + if ((tcp_flags & TCP_ACK) == 0) { >> + /* Let it pass through the ack skew check */ >> + ack = dst->seqlo; >> + } else if ((ack == 0 >> + && (tcp_flags & (TCP_ACK|TCP_RST)) == >> (TCP_ACK|TCP_RST)) >> + /* broken tcp stacks do not set ack */) { >> + /* Many stacks (ours included) will set the ACK number in an >> + * FIN|ACK if the SYN times out -- no sequence to ACK. */ >> + ack = dst->seqlo; >> + } >> + >> + if (seq == end) { >> + /* Ease sequencing restrictions on no data packets */ >> + seq = src->seqlo; >> + end = seq; >> + } >> + >> + ackskew = dst->seqlo - ack; >> +#define MAXACKWINDOW (0xffff + 1500) /* 1500 is an arbitrary >> fudge factor */ >> + if (SEQ_GEQ(src->seqhi, end) >> + /* Last octet inside other's window space */ >> + && SEQ_GEQ(seq, src->seqlo - (dst->max_win << dws)) >> + /* Retrans: not more than one window back */ >> + && (ackskew >= -MAXACKWINDOW) >> + /* Acking not more than one reassembled fragment backwards >> */ >> + && (ackskew <= (MAXACKWINDOW << sws)) >> + /* Acking not more than one window forward */ >> + && ((tcp_flags & TCP_RST) == 0 || orig_seq == src->seqlo >> + || (orig_seq == src->seqlo + 1) || (orig_seq + 1 == src- >> >seqlo))) { >> + /* Require an exact/+1 sequence match on resets when >> possible */ >> + >> + /* update max window */ >> + if (src->max_win < win) { >> + src->max_win = win; >> + } >> + /* synchronize sequencing */ >> + if (SEQ_GT(end, src->seqlo)) { >> + src->seqlo = end; >> + } >> + /* slide the window of what the other end can send */ >> + if (SEQ_GEQ(ack + (win << sws), dst->seqhi)) { >> + dst->seqhi = ack + MAX((win << sws), 1); >> + } >> + >> + /* update states */ >> + if (tcp_flags & TCP_SYN && src->state < >> CT_DPIF_TCPS_SYN_SENT) { >> + src->state = CT_DPIF_TCPS_SYN_SENT; >> + } >> + if (tcp_flags & TCP_FIN && src->state < >> CT_DPIF_TCPS_CLOSING) { >> + src->state = CT_DPIF_TCPS_CLOSING; >> + } >> + if (tcp_flags & TCP_ACK) { >> + if (dst->state == CT_DPIF_TCPS_SYN_SENT) { >> + dst->state = CT_DPIF_TCPS_ESTABLISHED; >> + } else if (dst->state == CT_DPIF_TCPS_CLOSING) { >> + dst->state = CT_DPIF_TCPS_FIN_WAIT_2; >> + } >> + } >> + if (tcp_flags & TCP_RST) { >> + src->state = dst->state = CT_DPIF_TCPS_TIME_WAIT; >> + } >> + >> + if (src->state >= CT_DPIF_TCPS_FIN_WAIT_2 >> + && dst->state >= CT_DPIF_TCPS_FIN_WAIT_2) { >> + update_expiration(conn_, CT_TM_TCP_CLOSED, now); >> + } else if (src->state >= CT_DPIF_TCPS_CLOSING >> + && dst->state >= CT_DPIF_TCPS_CLOSING) { >> + update_expiration(conn_, CT_TM_TCP_FIN_WAIT, now); >> + } else if (src->state < CT_DPIF_TCPS_ESTABLISHED >> + || dst->state < CT_DPIF_TCPS_ESTABLISHED) { >> + update_expiration(conn_, now, CT_TM_TCP_OPENING); >> + } else if (src->state >= CT_DPIF_TCPS_CLOSING >> + || dst->state >= CT_DPIF_TCPS_CLOSING) { >> + update_expiration(conn_, now, CT_TM_TCP_CLOSING); >> + } else { >> + update_expiration(conn_, now, CT_TM_TCP_ESTABLISHED); >> + } >> + } else if ((dst->state < CT_DPIF_TCPS_SYN_SENT >> + || dst->state >= CT_DPIF_TCPS_FIN_WAIT_2 >> + || src->state >= CT_DPIF_TCPS_FIN_WAIT_2) >> + && SEQ_GEQ(src->seqhi + MAXACKWINDOW, end) >> + /* Within a window forward of the originating packet >> */ >> + && SEQ_GEQ(seq, src->seqlo - MAXACKWINDOW)) { >> + /* Within a window backward of the originating packet >> */ >> + >> + /* >> + * This currently handles three situations: >> + * 1) Stupid stacks will shotgun SYNs before their peer >> + * replies. >> + * 2) When PF catches an already established stream (the >> + * firewall rebooted, the state table was flushed, >> routes >> + * changed...) >> + * 3) Packets get funky immediately after the connection >> + * closes (this should catch Solaris spurious ACK|FINs >> + * that web servers like to spew after a close) >> + * >> + * This must be a little more careful than the above code >> + * since packet floods will also be caught here. We don't >> + * update the TTL here to mitigate the damage of a packet >> + * flood and so the same code can handle awkward >> establishment >> + * and a loosened connection close. >> + * In the establishment case, a correct peer response will >> + * validate the connection, go through the normal state code >> + * and keep updating the state TTL. >> + */ >> + >> + /* update max window */ >> + if (src->max_win < win) { >> + src->max_win = win; >> + } >> + /* synchronize sequencing */ >> + if (SEQ_GT(end, src->seqlo)) { >> + src->seqlo = end; >> + } >> + /* slide the window of what the other end can send */ >> + if (SEQ_GEQ(ack + (win << sws), dst->seqhi)) { >> + dst->seqhi = ack + MAX((win << sws), 1); >> + } >> + >> + /* >> + * Cannot set dst->seqhi here since this could be a >> shotgunned >> + * SYN and not an already established connection. >> + */ >> + >> + if (tcp_flags & TCP_FIN && src->state < >> CT_DPIF_TCPS_CLOSING) { >> + src->state = CT_DPIF_TCPS_CLOSING; >> + } >> + >> + if (tcp_flags & TCP_RST) { >> + src->state = dst->state = CT_DPIF_TCPS_TIME_WAIT; >> + } >> + } else { >> + return CT_UPDATE_INVALID; >> + } >> + >> + return CT_UPDATE_VALID; >> +} >> + >> +static bool >> +tcp_valid_new(struct dp_packet *pkt) >> +{ >> + struct tcp_header *tcp = dp_packet_l4(pkt); >> + uint16_t tcp_flags = TCP_FLAGS(tcp->tcp_ctl); >> + >> + if (tcp_invalid_flags(tcp_flags)) { >> + return false; >> + } >> + >> + /* A syn+ack is not allowed to create a connection. We want to >> allow >> + * totally new connections (syn) or already established, not >> partially >> + * open (syn+ack). */ >> + if ((tcp_flags & TCP_SYN) && (tcp_flags & TCP_ACK)) { >> + return false; >> + } >> + >> + return true; >> +} >> + >> +static struct conn * >> +tcp_new_conn(struct dp_packet *pkt, long long now) >> +{ >> + struct conn_tcp* newconn = NULL; >> + struct tcp_header *tcp = dp_packet_l4(pkt); >> + struct tcp_peer *src, *dst; >> + uint16_t tcp_flags = TCP_FLAGS(tcp->tcp_ctl); >> + >> + newconn = xzalloc(sizeof *newconn); >> + >> + src = &newconn->peer[0]; >> + dst = &newconn->peer[1]; >> + >> + src->seqlo = ntohl(get_16aligned_be32(&tcp->tcp_seq)); >> + src->seqhi = src->seqlo + tcp_payload_length(pkt) + 1; >> + >> + if (tcp_flags & TCP_SYN) { >> + src->seqhi++; >> + src->wscale = tcp_get_wscale(tcp); >> + } else { >> + src->wscale = CT_WSCALE_UNKNOWN; >> + dst->wscale = CT_WSCALE_UNKNOWN; >> + } >> + src->max_win = MAX(ntohs(tcp->tcp_winsz), 1); >> + if (src->wscale & CT_WSCALE_MASK) { >> + /* Remove scale factor from initial window */ >> + uint8_t sws = src->wscale & CT_WSCALE_MASK; >> + src->max_win = DIV_ROUND_UP((uint32_t) src->max_win, 1 << >> sws); >> + } >> + if (tcp_flags & TCP_FIN) { >> + src->seqhi++; >> + } >> + dst->seqhi = 1; >> + dst->max_win = 1; >> + src->state = CT_DPIF_TCPS_SYN_SENT; >> + dst->state = CT_DPIF_TCPS_CLOSED; >> + >> + update_expiration(&newconn->up, now, CT_TM_TCP_FIRST_PACKET); >> + >> + return &newconn->up; >> +} >> + >> +struct ct_l4_proto ct_proto_tcp = { >> + .new_conn = tcp_new_conn, >> + .valid_new = tcp_valid_new, >> + .conn_update = tcp_conn_update, >> +}; >> diff --git a/lib/conntrack.c b/lib/conntrack.c >> new file mode 100644 >> index 0000000..e282485 >> --- /dev/null >> +++ b/lib/conntrack.c >> @@ -0,0 +1,883 @@ >> +/* >> + * Copyright (c) 2015, 2016 Nicira, Inc. >> + * >> + * Licensed under the Apache License, Version 2.0 (the "License"); >> + * you may not use this file except in compliance with the License. >> + * You may obtain a copy of the License at: >> + * >> + * http://www.apache.org/licenses/LICENSE-2.0 >> + * >> + * Unless required by applicable law or agreed to in writing, >> software >> + * distributed under the License is distributed on an "AS IS" BASIS, >> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or >> implied. >> + * See the License for the specific language governing permissions >> and >> + * limitations under the License. >> + */ >> + >> +#include <config.h> >> +#include "conntrack.h" >> + >> +#include <errno.h> >> +#include <sys/types.h> >> +#include <netinet/in.h> >> +#include <netinet/icmp6.h> >> + >> +#include "bitmap.h" >> +#include "conntrack-private.h" >> +#include "coverage.h" >> +#include "csum.h" >> +#include "dp-packet.h" >> +#include "flow.h" >> +#include "hmap.h" >> +#include "netdev.h" >> +#include "odp-netlink.h" >> +#include "openvswitch/vlog.h" >> +#include "ovs-rcu.h" >> +#include "random.h" >> +#include "timeval.h" >> + >> +VLOG_DEFINE_THIS_MODULE(conntrack); >> + >> +COVERAGE_DEFINE(conntrack_new_full); >> + >> +struct conn_lookup_ctx { >> + struct conn_key key; >> + struct conn *conn; >> + uint32_t hash; >> + bool reply; >> + bool related; >> +}; >> + >> +static bool conn_key_extract(struct conntrack *, struct dp_packet *, >> + struct conn_lookup_ctx *, uint16_t >> zone); >> +static uint32_t conn_key_hash(const struct conn_key *, uint32_t >> basis); >> +static void conn_key_reverse(struct conn_key *); >> +static void conn_key_lookup(struct conntrack_bucket *ctb, >> + struct conn_lookup_ctx *ctx, >> + long long now); >> +static bool valid_new(struct dp_packet *pkt, struct conn_key *); >> +static struct conn *new_conn(struct dp_packet *pkt, struct conn_key >> *, >> + long long now); >> +static void delete_conn(struct conn *); >> +static enum ct_update_res conn_update(struct conn *, struct >> dp_packet*, >> + bool reply, long long now); >> +static bool conn_expired(struct conn *, long long now); >> +static void set_mark(struct dp_packet *, struct conn *, >> + uint32_t val, uint32_t mask); >> +static void set_label(struct dp_packet *, struct conn *, >> + const struct ovs_key_ct_labels *val, >> + const struct ovs_key_ct_labels *mask); >> + >> +static struct ct_l4_proto *l4_protos[] = { >> + [IPPROTO_TCP] = &ct_proto_tcp, >> + [IPPROTO_UDP] = &ct_proto_other, >> + [IPPROTO_ICMP] = &ct_proto_other, >> + [IPPROTO_ICMPV6] = &ct_proto_other, >> +}; >> + >> +long long ct_timeout_val[] = { >> +#define CT_TIMEOUT(NAME, VAL) [CT_TM_##NAME] = VAL, >> + CT_TIMEOUTS >> +#undef CT_TIMEOUT >> +}; >> + >> +/* If the total number of connections goes above this value, no new >> connections >> + * are accepted */ >> +#define DEFAULT_N_CONN_LIMIT 3000000 >> + >> +/* Initializes the connection tracker 'ct'. The caller is >> responbile for >> + * calling 'conntrack_destroy()', when the instance is not needed >> anymore */ >> +void >> +conntrack_init(struct conntrack *ct) >> +{ >> + unsigned i; >> + >> + for (i = 0; i < CONNTRACK_BUCKETS; i++) { >> + struct conntrack_bucket *ctb = &ct->buckets[i]; >> + >> + ct_lock_init(&ctb->lock); >> + ct_lock_lock(&ctb->lock); >> + hmap_init(&ctb->connections); >> + ct_lock_unlock(&ctb->lock); >> + } >> + ct->hash_basis = random_uint32(); >> + atomic_count_init(&ct->n_conn, 0); >> + atomic_init(&ct->n_conn_limit, DEFAULT_N_CONN_LIMIT); >> +} >> + >> +/* Destroys the connection tracker 'ct' and frees all the allocated >> memory. */ >> +void >> +conntrack_destroy(struct conntrack *ct) >> +{ >> + unsigned i; >> + >> + for (i = 0; i < CONNTRACK_BUCKETS; i++) { >> + struct conntrack_bucket *ctb = &ct->buckets[i]; >> + struct conn *conn; >> + >> + ct_lock_lock(&ctb->lock); >> + HMAP_FOR_EACH_POP(conn, node, &ctb->connections) { >> + atomic_count_dec(&ct->n_conn); >> + delete_conn(conn); >> + } >> + hmap_destroy(&ctb->connections); >> + ct_lock_unlock(&ctb->lock); >> + ct_lock_destroy(&ctb->lock); >> + } >> +} >> + > >> +static unsigned hash_to_bucket(uint32_t hash) >> +{ >> + /* Extracts the most significant bits in hash. The least >> significant bits >> + * are already used internally by the hmap implementation. */ >> + BUILD_ASSERT(CONNTRACK_BUCKETS_SHIFT < 32 && >> CONNTRACK_BUCKETS_SHIFT >= 1); >> + >> + return (hash >> (32 - CONNTRACK_BUCKETS_SHIFT)) % >> CONNTRACK_BUCKETS; >> +} >> + >> +static void >> +write_ct_md(struct dp_packet *pkt, uint16_t state, uint16_t zone, >> + uint32_t mark, ovs_u128 label) >> +{ >> + pkt->md.ct_state = state | CS_TRACKED; >> + pkt->md.ct_zone = zone; >> + pkt->md.ct_mark = mark; >> + pkt->md.ct_label = label; >> +} >> + >> +static struct conn * >> +conn_not_found(struct conntrack *ct, struct dp_packet *pkt, >> + struct conn_lookup_ctx *ctx, uint16_t *state, bool >> commit, >> + long long now) >> +{ >> + unsigned bucket = hash_to_bucket(ctx->hash); >> + struct conn *nc = NULL; >> + >> + if (!valid_new(pkt, &ctx->key)) { >> + *state |= CS_INVALID; >> + return nc; >> + } >> + >> + *state |= CS_NEW; >> + >> + if (commit) { >> + unsigned int n_conn_limit; >> + >> + atomic_read_relaxed(&ct->n_conn_limit, &n_conn_limit); >> + >> + if (atomic_count_get(&ct->n_conn) >= n_conn_limit) { >> + COVERAGE_INC(conntrack_new_full); >> + return nc; >> + } >> + >> + nc = new_conn(pkt, &ctx->key, now); >> + >> + memcpy(&nc->rev_key, &ctx->key, sizeof nc->rev_key); >> + >> + conn_key_reverse(&nc->rev_key); >> + hmap_insert(&ct->buckets[bucket].connections, &nc->node, >> ctx->hash); >> + atomic_count_inc(&ct->n_conn); >> + } >> + >> + return nc; >> +} >> + >> +static struct conn * >> +process_one(struct conntrack *ct, struct dp_packet *pkt, >> + struct conn_lookup_ctx *ctx, uint16_t zone, >> + bool commit, long long now) >> +{ >> + unsigned bucket = hash_to_bucket(ctx->hash); >> + struct conn *conn = ctx->conn; >> + uint16_t state = 0; >> + >> + if (conn) { >> + if (ctx->related) { >> + state |= CS_RELATED; >> + if (ctx->reply) { >> + state |= CS_REPLY_DIR; >> + } >> + } else { >> + enum ct_update_res res; >> + >> + res = conn_update(conn, pkt, ctx->reply, now); >> + >> + switch (res) { >> + case CT_UPDATE_VALID: >> + state |= CS_ESTABLISHED; >> + if (ctx->reply) { >> + state |= CS_REPLY_DIR; >> + } >> + break; >> + case CT_UPDATE_INVALID: >> + state |= CS_INVALID; >> + break; >> + case CT_UPDATE_NEW: >> + hmap_remove(&ct->buckets[bucket].connections, &conn- >> >node); >> + atomic_count_dec(&ct->n_conn); >> + delete_conn(conn); >> + conn = conn_not_found(ct, pkt, ctx, &state, commit, >> now); >> + break; >> + } > > [Antonio F] Sorry to repeat, but I'd prefer to add the 'default' > case, here. I mean something like > > default: > state |= CS_INVALID; > break; > > I know if we add new items to enum ct_update_res we can get a > warning from the compiler, but I wouldn't rely on that.
If we're going to rely upon a default case that we don't expect to hit, we should consider calling OVS_NOT_REACHED() rather than treating the traffic as invalid; this would be easier to track down than quietly marking some traffic as invalid. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev