On 15 April 2016 at 17:02, Daniele Di Proietto <diproiet...@vmware.com> wrote: > This commit adds the conntrack module. > > It is a connection tracker that resides entirely in userspace. Its > primary user will be the dpif-netdev datapath. > > The module main goal is to provide conntrack_execute(), which offers a > convenient interface to implement the datapath ct() action. > > The conntrack module uses two submodules to deal with the l4 protocol > details (conntrack-other for UDP and ICMP, conntrack-tcp for TCP). > > The conntrack-tcp submodule implementation is adapted from FreeBSD's pf > subsystem, therefore it's BSD licensed. It has been slightly altered to > match the OVS coding style and to allow the pickup of already > established connections. > > Signed-off-by: Daniele Di Proietto <diproiet...@vmware.com>
Thanks for submitting this, I know there's a few people excited about this feature. I notice that there is no NEWS item about this, were you intending to hold off on announcing it until there is feature parity with the kernel, eg support for IPv[46] fragments; ALGs; NAT? > diff --git a/lib/conntrack-other.c b/lib/conntrack-other.c > new file mode 100644 > index 0000000..65d02a9 > --- /dev/null > +++ b/lib/conntrack-other.c > @@ -0,0 +1,91 @@ > +/* > + * Copyright (c) 2015, 2016 Nicira, Inc. > + * > + * Licensed under the Apache License, Version 2.0 (the "License"); > + * you may not use this file except in compliance with the License. > + * You may obtain a copy of the License at: > + * > + * http://www.apache.org/licenses/LICENSE-2.0 > + * > + * Unless required by applicable law or agreed to in writing, software > + * distributed under the License is distributed on an "AS IS" BASIS, > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > + * See the License for the specific language governing permissions and > + * limitations under the License. > + */ > + > +#include <config.h> > + > +#include "conntrack-private.h" > +#include "dp-packet.h" > + > +enum other_state { > + OTHERS_FIRST, > + OTHERS_MULTIPLE, > + OTHERS_BIDIR, > +}; > + > +struct conn_other { > + struct conn up; > + enum other_state state; > +}; > + > +static const long long other_timeouts[] = { > + [OTHERS_FIRST] = 60 * 1000, > + [OTHERS_MULTIPLE] = 60 * 1000, > + [OTHERS_BIDIR] = 30 * 1000, > +}; Are these timeouts in milliseconds? (a comment or #define might help with that) <snip> > diff --git a/lib/conntrack.c b/lib/conntrack.c > new file mode 100644 > index 0000000..840335b > --- /dev/null > +++ b/lib/conntrack.c ... > +static struct ct_l4_proto *l4_protos[] = { > + [IPPROTO_TCP] = &ct_proto_tcp, > + [IPPROTO_UDP] = &ct_proto_other, > + [IPPROTO_ICMP] = &ct_proto_other, > +}; I see that conntrack-other is shared by UDP and ICMP. In the Linux datapath, the UDP handler also checks UDP length and UDP checksum - I wonder if we can share most of the code here for these protocols but add these additional checks for the UDP case? > +static struct conn * > +conn_not_found(struct conntrack *ct, struct dp_packet *pkt, > + struct conn_lookup_ctx *ctx, uint8_t *state, bool commit, > + long long now) > +{ > + unsigned bucket = hash_to_bucket(ctx->hash); > + struct conn *nc = NULL; > + > + if (!valid_new(pkt, &ctx->key)) { > + *state |= CS_INVALID; > + return nc; > + } > + > + *state |= CS_NEW; > + > + if (commit) { > + nc = new_conn(pkt, &ctx->key, now); > + > + memcpy(&nc->rev_key, &ctx->key, sizeof nc->rev_key); > + > + conn_key_reverse(&nc->rev_key); > + hmap_insert(&ct->connections[bucket], &nc->node, ctx->hash); > + } > + > + return nc; > +} This function can return NULL.... > +static struct conn * > +process_one(struct conntrack *ct, struct dp_packet *pkt, > + struct conn_lookup_ctx *ctx, uint16_t zone, > + bool commit, long long now) > +{ > + unsigned bucket = hash_to_bucket(ctx->hash); > + struct conn *conn = ctx->conn; > + uint8_t state = 0; > + > + if (conn) { > + if (ctx->related) { > + state |= CS_RELATED; > + if (ctx->reply) { > + state |= CS_REPLY_DIR; > + } > + } else { > + enum ct_update_res res; > + > + res = conn_update(conn, pkt, ctx->reply, now); > + > + switch (res) { > + case CT_UPDATE_VALID: > + state |= CS_ESTABLISHED; > + if (ctx->reply) { > + state |= CS_REPLY_DIR; > + } > + break; > + case CT_UPDATE_INVALID: > + state |= CS_INVALID; > + break; > + case CT_UPDATE_NEW: > + hmap_remove(&ct->connections[bucket], &conn->node); > + delete_conn(conn); > + conn = conn_not_found(ct, pkt, ctx, &state, commit, now); ...which is then stored here... > + break; > + } > + } > + > + pkt->md.ct_label = conn->label; > + pkt->md.ct_mark = conn->mark; > + write_ct_md(pkt, state, zone, conn->mark, conn->label); ...and dereferenced here. Is it possible to send invalid packets for an existing connection and cause a NULL pointer dereference? > + } else { > + conn = conn_not_found(ct, pkt, ctx, &state, commit, now); > + write_ct_md(pkt, state, zone, 0, (ovs_u128) {{0}}); Maybe we can define OVS_U128_MIN (ovs_u128) {{0}} in a header somewhere? > +/* Cleans up old connection entries. Should be called periodically. */ > +void > +conntrack_run(struct conntrack *ct) > +{ > + unsigned bucket = hash_to_bucket(ct->purge_bucket); > + uint32_t inner_bucket = ct->purge_inner_bucket, > + inner_offset = ct->purge_inner_offset; Style seems a little unusual for the above 'inner_offset'. > +static inline bool > +extract_l4_udp(struct conn_key *key, const void *data, size_t size) > +{ > + const struct udp_header *udp = data; > + > + if (OVS_UNLIKELY(size < UDP_HEADER_LEN)) { > + return false; > + } Maybe my comment earlier about UDP len and checksum could be addressed here? ... > +static bool > +conn_key_extract(struct conntrack *ct, struct dp_packet *pkt, > + struct conn_lookup_ctx *ctx, uint16_t zone) > +{ > + const struct eth_header *l2 = dp_packet_l2(pkt); > + const struct ip_header *l3 = dp_packet_l3(pkt); > + const char *l4 = dp_packet_l4(pkt); > + const char *tail = dp_packet_tail(pkt); > + bool ok; > + > + memset(ctx, 0, sizeof *ctx); > + > + if (!l2 || !l3 || !l4) { > + return false; > + } > + > + ctx->key.zone = zone; > + > + /* XXX In this function we parse the packet (again, it has already > + * gone through miniflow_extract()) for two reasons: > + * > + * 1) To extract the l3 addresses and l4 ports. > + * We already have the l3 and l4 headers' pointers. Extracting > + * the l3 addresses and the l4 ports is really cheap, since they > + * can be found at fixed locations. > + * 2) To extract the l3 and l4 types. > + * Extracting the l3 and l4 types (especially the l3[1]) on the > + * other hand is quite expensive, because they're not at a > + * fixed location. > + * > + * Here's a way to avoid (2) with the help of the datapath. > + * The datapath doesn't keep the packet's extracted flow[2], so > + * using that is not an option. We could use the packet's matching > + * megaflow, but we have to make sure that the l3 and l4 types > + * are unwildcarded. This means either: > + * > + * a) dpif-netdev unwildcards the l3 (and l4) types when a new flow > + * is installed if the actions contains ct(). This is what the > + * kernel datapath does. It is not so straightforward, though. > + * > + * b) ofproto-dpif-xlate unwildcards the l3 (and l4) types when > + * translating a ct() action. This is already done in different > + * actions and since both the userspace and the kernel datapath > + * would benefit from it, it seems an appropriate place to do > + * it. > + * > + * --- > + * [1] A simple benchmark (running only the connection tracker > + * over and over on the same packets) shows that if the > + * l3 type is already provided we are 15% faster (running the > + * connection tracker over a couple of DPDK devices with a > + * stream of UDP 64-bytes packets shows that we are 4% faster). > + * > + * [2] The reasons for this are that keeping the flow increases > + * (slightly) the cache footprint and increases computation > + * time as we move the packet around. Most importantly, the flow > + * should be updated by the actions and this can be slow, as > + * we use a sparse representation (miniflow). > + * > + */ I discussed the above offline with Jarno; we will always unwildcard the l3 -- see xlate_wc_init(). Do we have a benchmark on how costly the options are for figuring out the l4 type? > +static void > +conn_key_lookup(struct conntrack *ct, > + struct conn_lookup_ctx *ctx, > + unsigned bucket, > + long long now) > +{ > + struct conn *conn, *found = NULL; > + uint32_t hash = ctx->hash; > + bool reply; > + > + HMAP_FOR_EACH_WITH_HASH (conn, node, hash, &ct->connections[bucket]) { > + if (!memcmp(&conn->key, &ctx->key, sizeof(conn->key))) { > + found = conn; > + reply = false; > + break; > + } > + if (!memcmp(&conn->rev_key, &ctx->key, sizeof(conn->rev_key))) { > + found = conn; > + reply = true; > + break; > + } > + } > + > + if (found) { > + if (conn_expired(found, now)) { > + found = NULL; > + } else { > + ctx->reply = reply; > + } > + } Should this check be within the HMAP_FOR_EACH_WITH_HASH? Separate question, and maybe something to be followed up on separately from this series, today with the current userspace + kernel implementation we can install a flow like: arp,actions=ct(table=1) At OpenFlow layer we'll accept this (although maybe we shouldn't)... When processing an upcall, we'll translate this (though we have enough information to at least complain if not more) Then when installing/executing into the kernel datapath, we'll reject the command while parsing the action. I'm considering that maybe we should either a) Restrict this all the way up at OpenFlow, or b) loosen the restriction in kernel datapath and allow such flows (but mark the traffic as invalid early in conntrack processing and skip to the next action). So, to my question: How does the userspace handle a case like this? _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev