Thanks for the detailed review Joe, replies inline
On 19/04/2016 14:36, "Joe Stringer" <j...@ovn.org> wrote: >On 15 April 2016 at 17:02, Daniele Di Proietto <diproiet...@vmware.com> wrote: >> This commit adds the conntrack module. >> >> It is a connection tracker that resides entirely in userspace. Its >> primary user will be the dpif-netdev datapath. >> >> The module main goal is to provide conntrack_execute(), which offers a >> convenient interface to implement the datapath ct() action. >> >> The conntrack module uses two submodules to deal with the l4 protocol >> details (conntrack-other for UDP and ICMP, conntrack-tcp for TCP). >> >> The conntrack-tcp submodule implementation is adapted from FreeBSD's pf >> subsystem, therefore it's BSD licensed. It has been slightly altered to >> match the OVS coding style and to allow the pickup of already >> established connections. >> >> Signed-off-by: Daniele Di Proietto <diproiet...@vmware.com> > >Thanks for submitting this, I know there's a few people excited about >this feature. > >I notice that there is no NEWS item about this, were you intending to >hold off on announcing it until there is feature parity with the >kernel, eg support for IPv[46] fragments; ALGs; NAT? Actually I didn't think about this :-) I'll add a line in NEWS > > >> diff --git a/lib/conntrack-other.c b/lib/conntrack-other.c >> new file mode 100644 >> index 0000000..65d02a9 >> --- /dev/null >> +++ b/lib/conntrack-other.c >> @@ -0,0 +1,91 @@ >> +/* >> + * Copyright (c) 2015, 2016 Nicira, Inc. >> + * >> + * Licensed under the Apache License, Version 2.0 (the "License"); >> + * you may not use this file except in compliance with the License. >> + * You may obtain a copy of the License at: >> + * >> + * http://www.apache.org/licenses/LICENSE-2.0 >> + * >> + * Unless required by applicable law or agreed to in writing, software >> + * distributed under the License is distributed on an "AS IS" BASIS, >> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. >> + * See the License for the specific language governing permissions and >> + * limitations under the License. >> + */ >> + >> +#include <config.h> >> + >> +#include "conntrack-private.h" >> +#include "dp-packet.h" >> + >> +enum other_state { >> + OTHERS_FIRST, >> + OTHERS_MULTIPLE, >> + OTHERS_BIDIR, >> +}; >> + >> +struct conn_other { >> + struct conn up; >> + enum other_state state; >> +}; >> + >> +static const long long other_timeouts[] = { >> + [OTHERS_FIRST] = 60 * 1000, >> + [OTHERS_MULTIPLE] = 60 * 1000, >> + [OTHERS_BIDIR] = 30 * 1000, >> +}; > >Are these timeouts in milliseconds? (a comment or #define might help with that) Sure, I'll add a comment. > ><snip> > >> diff --git a/lib/conntrack.c b/lib/conntrack.c >> new file mode 100644 >> index 0000000..840335b >> --- /dev/null >> +++ b/lib/conntrack.c >... >> +static struct ct_l4_proto *l4_protos[] = { >> + [IPPROTO_TCP] = &ct_proto_tcp, >> + [IPPROTO_UDP] = &ct_proto_other, >> + [IPPROTO_ICMP] = &ct_proto_other, >> +}; > >I see that conntrack-other is shared by UDP and ICMP. In the Linux >datapath, the UDP handler also checks UDP length and UDP checksum - I >wonder if we can share most of the code here for these protocols but >add these additional checks for the UDP case? Thanks for pointing this out! I think it's a good idea to check the UDP length, I've added a check later. This didn't handle at all checksum verification. I thought it was out of scope for the connection tracker, but since the kernel connection tracker does this, I've added the validation. None of this need (yet) a separate conntrack-icmp module, I'm implementing it when the key is being extracted (as you point out later). We would need a separate conntrack-icmp module if we wanted to track icmp type and code. > >> +static struct conn * >> +conn_not_found(struct conntrack *ct, struct dp_packet *pkt, >> + struct conn_lookup_ctx *ctx, uint8_t *state, bool commit, >> + long long now) >> +{ >> + unsigned bucket = hash_to_bucket(ctx->hash); >> + struct conn *nc = NULL; >> + >> + if (!valid_new(pkt, &ctx->key)) { >> + *state |= CS_INVALID; >> + return nc; >> + } >> + >> + *state |= CS_NEW; >> + >> + if (commit) { >> + nc = new_conn(pkt, &ctx->key, now); >> + >> + memcpy(&nc->rev_key, &ctx->key, sizeof nc->rev_key); >> + >> + conn_key_reverse(&nc->rev_key); >> + hmap_insert(&ct->connections[bucket], &nc->node, ctx->hash); >> + } >> + >> + return nc; >> +} > >This function can return NULL.... > >> +static struct conn * >> +process_one(struct conntrack *ct, struct dp_packet *pkt, >> + struct conn_lookup_ctx *ctx, uint16_t zone, >> + bool commit, long long now) >> +{ >> + unsigned bucket = hash_to_bucket(ctx->hash); >> + struct conn *conn = ctx->conn; >> + uint8_t state = 0; >> + >> + if (conn) { >> + if (ctx->related) { >> + state |= CS_RELATED; >> + if (ctx->reply) { >> + state |= CS_REPLY_DIR; >> + } >> + } else { >> + enum ct_update_res res; >> + >> + res = conn_update(conn, pkt, ctx->reply, now); >> + >> + switch (res) { >> + case CT_UPDATE_VALID: >> + state |= CS_ESTABLISHED; >> + if (ctx->reply) { >> + state |= CS_REPLY_DIR; >> + } >> + break; >> + case CT_UPDATE_INVALID: >> + state |= CS_INVALID; >> + break; >> + case CT_UPDATE_NEW: >> + hmap_remove(&ct->connections[bucket], &conn->node); >> + delete_conn(conn); >> + conn = conn_not_found(ct, pkt, ctx, &state, commit, now); > >...which is then stored here... > >> + break; >> + } >> + } >> + >> + pkt->md.ct_label = conn->label; >> + pkt->md.ct_mark = conn->mark; >> + write_ct_md(pkt, state, zone, conn->mark, conn->label); > >...and dereferenced here. Is it possible to send invalid packets for >an existing connection and cause a NULL pointer dereference? Yes, thanks for spotting this. I've added a NULL pointer check and moved this out of the if branch (it's needed also in the else branch) > > >> + } else { >> + conn = conn_not_found(ct, pkt, ctx, &state, commit, now); >> + write_ct_md(pkt, state, zone, 0, (ovs_u128) {{0}}); > >Maybe we can define OVS_U128_MIN (ovs_u128) {{0}} in a header somewhere? Ok. I've put OVS_U128_MIN in util.h (like OVS_U128_MAX) and defined an OVS_U128_ZERO macro as an alias. > > >> +/* Cleans up old connection entries. Should be called periodically. */ >> +void >> +conntrack_run(struct conntrack *ct) >> +{ >> + unsigned bucket = hash_to_bucket(ct->purge_bucket); >> + uint32_t inner_bucket = ct->purge_inner_bucket, >> + inner_offset = ct->purge_inner_offset; > >Style seems a little unusual for the above 'inner_offset'. Right, I put the variable in two separate statements. > > >> +static inline bool >> +extract_l4_udp(struct conn_key *key, const void *data, size_t size) >> +{ >> + const struct udp_header *udp = data; >> + >> + if (OVS_UNLIKELY(size < UDP_HEADER_LEN)) { >> + return false; >> + } > >Maybe my comment earlier about UDP len and checksum could be addressed here? Yes, thank you! > >... > > >> +static bool >> +conn_key_extract(struct conntrack *ct, struct dp_packet *pkt, >> + struct conn_lookup_ctx *ctx, uint16_t zone) >> +{ >> + const struct eth_header *l2 = dp_packet_l2(pkt); >> + const struct ip_header *l3 = dp_packet_l3(pkt); >> + const char *l4 = dp_packet_l4(pkt); >> + const char *tail = dp_packet_tail(pkt); >> + bool ok; >> + >> + memset(ctx, 0, sizeof *ctx); >> + >> + if (!l2 || !l3 || !l4) { >> + return false; >> + } >> + >> + ctx->key.zone = zone; >> + >> + /* XXX In this function we parse the packet (again, it has already >> + * gone through miniflow_extract()) for two reasons: >> + * >> + * 1) To extract the l3 addresses and l4 ports. >> + * We already have the l3 and l4 headers' pointers. Extracting >> + * the l3 addresses and the l4 ports is really cheap, since they >> + * can be found at fixed locations. >> + * 2) To extract the l3 and l4 types. >> + * Extracting the l3 and l4 types (especially the l3[1]) on the >> + * other hand is quite expensive, because they're not at a >> + * fixed location. >> + * >> + * Here's a way to avoid (2) with the help of the datapath. >> + * The datapath doesn't keep the packet's extracted flow[2], so >> + * using that is not an option. We could use the packet's matching >> + * megaflow, but we have to make sure that the l3 and l4 types >> + * are unwildcarded. This means either: >> + * >> + * a) dpif-netdev unwildcards the l3 (and l4) types when a new flow >> + * is installed if the actions contains ct(). This is what the >> + * kernel datapath does. It is not so straightforward, though. >> + * >> + * b) ofproto-dpif-xlate unwildcards the l3 (and l4) types when >> + * translating a ct() action. This is already done in different >> + * actions and since both the userspace and the kernel datapath >> + * would benefit from it, it seems an appropriate place to do >> + * it. >> + * >> + * --- >> + * [1] A simple benchmark (running only the connection tracker >> + * over and over on the same packets) shows that if the >> + * l3 type is already provided we are 15% faster (running the >> + * connection tracker over a couple of DPDK devices with a >> + * stream of UDP 64-bytes packets shows that we are 4% faster). >> + * >> + * [2] The reasons for this are that keeping the flow increases >> + * (slightly) the cache footprint and increases computation >> + * time as we move the packet around. Most importantly, the flow >> + * should be updated by the actions and this can be slow, as >> + * we use a sparse representation (miniflow). >> + * >> + */ > >I discussed the above offline with Jarno; we will always unwildcard >the l3 -- see xlate_wc_init(). Oh, you're right, I forgot about that, I guess I need at least to update the comment. > >Do we have a benchmark on how costly the options are for figuring out >the l4 type? It's a problem only for IPv6, as we need to parse the extension headers. None of the solution I thought about seem particularly clean. I'd be inclined to keep this here and revisit it later. > > >> +static void >> +conn_key_lookup(struct conntrack *ct, >> + struct conn_lookup_ctx *ctx, >> + unsigned bucket, >> + long long now) >> +{ >> + struct conn *conn, *found = NULL; >> + uint32_t hash = ctx->hash; >> + bool reply; >> + >> + HMAP_FOR_EACH_WITH_HASH (conn, node, hash, &ct->connections[bucket]) { >> + if (!memcmp(&conn->key, &ctx->key, sizeof(conn->key))) { >> + found = conn; >> + reply = false; >> + break; >> + } >> + if (!memcmp(&conn->rev_key, &ctx->key, sizeof(conn->rev_key))) { >> + found = conn; >> + reply = true; >> + break; >> + } >> + } >> + >> + if (found) { >> + if (conn_expired(found, now)) { >> + found = NULL; >> + } else { >> + ctx->reply = reply; >> + } >> + } > >Should this check be within the HMAP_FOR_EACH_WITH_HASH? Yes, that also simplify the code a little bit, thank you. > > >Separate question, and maybe something to be followed up on separately >from this series, today with the current userspace + kernel >implementation we can install a flow like: > >arp,actions=ct(table=1) > >At OpenFlow layer we'll accept this (although maybe we shouldn't)... >When processing an upcall, we'll translate this (though we have enough >information to at least complain if not more) >Then when installing/executing into the kernel datapath, we'll reject >the command while parsing the action. > >I'm considering that maybe we should either a) Restrict this all the >way up at OpenFlow, or b) loosen the restriction in kernel datapath >and allow such flows (but mark the traffic as invalid early in >conntrack processing and skip to the next action). > >So, to my question: How does the userspace handle a case like this? The userspace datapath does what you're suggesting as b): in conntrack_execute() if the packet is not ipv4 or ipv6 the packet is marked as invalid (the key extraction fails). Restricting this at the OpenFlow level means that it wouldn't be possible to a have _single_ flow to send both ipv4 and ipv6 traffic to the connection tracker. Do we care about that? _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev