On 10 September 2015 at 19:03, Jarno Rajahalme <jrajaha...@nicira.com> wrote:
>
>> On Sep 9, 2015, at 7:00 PM, Joe Stringer <joestrin...@nicira.com> wrote:
>> @@ -672,6 +705,8 @@ struct ovs_action_push_tnl {
>>  * indicate the new packet contents. This could potentially still be
>>  * %ETH_P_MPLS if the resulting MPLS label stack is not empty.  If there
>>  * is no MPLS label stack, as determined by ethertype, no action is taken.
>> + * @OVS_ACTION_ATTR_CT: Track the connection. Populate the conntrack-related
>> + * entries in the flow key.
>>  *
>>  * Only a single header can be set with a single %OVS_ACTION_ATTR_SET.  Not 
>> all
>>  * fields within a header are modifiable, e.g. the IPv4 protocol and fragment
>> @@ -702,6 +737,7 @@ enum ovs_action_attr {
>>                                      * data immediately followed by a mask.
>>                                      * The data must be zero for the unmasked
>>                                      * bits. */
>> +     OVS_ACTION_ATTR_CT,           /* One nested OVS_CT_ATTR_* . */
>>
>
> I guess this should be “Nested OVS_CT_ATTR_*”, as there can be multiple 
> nested attributes?

Right you are, I'll follow up with this on upstream too.


>> diff --git a/lib/flow.h b/lib/flow.h
>> index d8632ff..2d6d3ee 100644
>> --- a/lib/flow.h
>> +++ b/lib/flow.h
>> @@ -103,8 +103,10 @@ struct flow {
>>     union flow_in_port in_port; /* Input port.*/
>>     uint32_t recirc_id;         /* Must be exact match. */
>>     uint32_t conj_id;           /* Conjunction ID. */
>> +    uint16_t ct_zone;           /* Connection Zone. */
>> +    uint8_t ct_state;           /* Connection state. */
>> +    uint8_t pad1[3];            /* Pad to 64 bits. */
>>     ofp_port_t actset_output;   /* Output port in action set. */
>> -    uint8_t pad1[6];            /* Pad to 64 bits. */
>>
>
> It would be slightly better to put ct_zone and ct_state right after 
> recirc_id. This would make miniflows smaller, as the recirc_id, ct_zone, and 
> ct_state would all be in the same miniflow data unit (conj_id and 
> actset_output are userspace-only fields, so they would be zeroes for datapath 
> miniflows).
>
> We should probably make ct_state a 16 bit field (uint16_t) in struct flow 
> already, for future-proofness, even if the kernel interface is still 8 bits.
>
> Also, you should bump FLOW_WC_SEQ to make sure all the relevant places are 
> updated for the new fields.

OK, I'll adjust these.

>> --- a/lib/meta-flow.h
>> +++ b/lib/meta-flow.h
>> @@ -703,6 +703,51 @@ enum OVS_PACKED_ENUM mf_field_id {
>>      */
>>     MFF_PKT_MARK,
>>
>> +    /* "ct_state".
>> +     *
>> +     * Connection tracking state.  The field is populated by the NXAST_CT
>> +     * action. The following bit values describe the state of the 
>> connection:
>> +     *
>> +     *   - New (0x01): This is the beginning of a new connection.
>> +     *   - Established (0x02): This is part of an already existing 
>> connection.
>> +     *   - Related (0x04): This is a separate connection that is related to 
>> an
>> +     *                     existing connection.
>> +     *   - Invalid (0x20): This flow could not be associated with a 
>> connection.
>> +     *                     This could be set for a variety of reasons,
>> +     *                     including (but not limited to):
>> +     *                     - L3/L4 protocol handler is not 
>> loaded/unavailable.
>> +     *                     - L3/L4 protocol handler determines that the 
>> packet
>> +     *                       is malformed or invalid for the current FSM 
>> stage.
>> +     *                     - Packets are unexpected length for protocol.
>> +     *   - Reply (0x40): This flow is in the reply direction, ie it did not
>> +     *                   initiate the connection.
>> +     *   - Tracked (0x80): Connection tracking has occurred.
>> +     *

This should also mention that "related" catches related packets such
as ICMP destination unreachable messages (this is "related" to the
connection which initiated the request to the unreachable
destination). I'll fix that up, here and in the manual.

> Have we documented anywhere which of the bits are mutually exclusive and how. 
> E.g.,
> - If -trk, all the other bits are also zeroes (also invalid?)
> - new and established are mutually exclusive (+new+est is not possible)

Right.

> - new and reply are mutually exclusive (reply direction is defined as 
> opposite of the initial +new)

Strictly speaking I don't think this is the case. I don't have an
example, but it is possible for this to be set in the conntrack info
specified in the Linux interface, see
linux/netfilter/nf_conntrack_common.h. ct_state is translated roughly
from this and potentially other pieces of conntrack information.

> - related and new can be set at the same time (but does related remain set 
> for the duration of the connection, or is it also mutually exclusive with 
> established?)

Related can be present with any other bit. I believe that it remains
set for the duration of the connection. Looks like a good candidate
for another test in the testsuite.

> - apparently related and reply are not mutually exclusive, but when are they 
> set at the same time?

I see "related" as another bit that is independent of all of the
others; whether the connection is related or not, the packets can
still be part of a new connection, established, reply, etc. Perhaps
the more pressing question is exactly what "reply" means in this
context: Is it referring to the direction of the original connection,
or only the related connection? This is something I can further
clarify.

> - are invalid and tracked mutually exclusive?

This question suggests that the description of "tracked" is
insufficient, so I'll amend it. See below for a more comprehensive
definition.

> Also, what are the possible transitions from one state to another?

From the perspective of the packet, there are two states: untracked
and tracked. Untracked means that the packet has not yet been sent
through the connection tracker. If the packet is sent to the
connection tracker via "ct(table=X)", the packet becomes tracked. It
remains tracked until the packet egresses the switch.

From the perspective of connections, connection state can only be
identified once the packet is tracked. There are two connection
states: uncommitted and committed. If a packet has the "new" ct_state
flag, then its corresponding connection is uncommitted, meaning that
the connection tracker cannot identify any other packets as belonging
to the same or related connections. If an uncommitted connection is
committed via "ct(commit)", then subsequent packets for the connection
or related connections may be correlated. In general, any connection
may transition back to uncommitted if no packets are seen for some
time. For specific connection types, such as TCP, there may be
particular packet types that also cause committed connecitons to
transition back to the uncommitted state. The exact operation of these
transitions back to uncommitted are outside the scope of the OpenFlow
definition.

The "new" and "invalid" flags map to the uncommitted connection state.
The "established" and "reply" flags map to the committed connection
state. The "related" flag may be present for connections in either
state.

>> @@ -4448,6 +4455,245 @@ format_DEBUG_RECIRC(const struct ofpact_null *a 
>> OVS_UNUSED, struct ds *s)
>> {
>>     ds_put_cstr(s, "debug_recirc");
>> }
>> +
>> +/* Action structure for NXAST_CT.
>> + *
>> + * Pass traffic to the connection tracker.
>> + *
>> + * The "zone" specifies a context within which the tracking is done:
>> + *
>> + *      If 'zone_src' is nonzero, this specifies that the zone should be
>> + *      sourced from a field zone_src[ofs:ofs+nbits]. The format and 
>> semantics
>> + *      of 'zone_src' and 'zone_ofs_nbits' are similar to those for the
>> + *      NXAST_REG_LOAD action. The acceptable nxm_header values for 
>> 'zone_src'
>> + *      are the same as the acceptable nxm_header values for the 'src' 
>> field of
>> + *      NXAST_REG_MOVE.
>> + *
>> + *      If 'zone_src' is zero, then the value of 'zone_imm' will be used as 
>> the
>> + *      connection tracking zone.
>> + *
>> + * If "recirc_table" has a value other than NX_CT_RECIRC_NONE, then this 
>> field
>> + * specifies that the packet should be logically cloned after the packet is
>
> The packet is logically cloned *before* it is sent through the connection 
> tracker,
> as only the cloned packet gets the ct_state and ct_zone, right?

True. Next time I'll go through and try to ensure that each of the
commit messages, ofp-actions.c and the manpages are consistent on
their descriptions of this.

Strictly speaking, if the table is specified, then it is like an
output action. The pipeline splits prior to the action; one copy goes
through the action and continues to the logical end of that
pipeline(in output this means send the packet and terminate, in ct
this means send through the connection tracker and return to the
OpenFlow pipeline at the table specified). The original packet does
not go through that action, and instead skips straight to the next
action in the list. If it is the last action, then the pipeline
terminates.

>> + * sent through the connection tracker, and pipeline processing for the 
>> cloned
>> + * packet will continue in the OpenFlow table specified by this value. The
>> + * NXM_NX_CT_* fields will be populated for the subsequent processing. It is
>> + * strongly recommended that this table is later than the current table, to
>> + * prevent loops.
>> + *
>
> ‘alg’ and ofpacts are introduced in later patches, and I guess it is too much 
> work to not deal with them here yet?

I could go back and adjust the definition to place padding there and
drop the ofpacts instead.

> Anyway, a thought about the nested actions just popped in my mind: The fact 
> that the actions are nested means that they all must be produced by a single 
> OpenFlow rule. I.e., the same rule that does conntrack must also set the 
> labels, etc. This means that either there will be a potentially huge number 
> of rules with conntrack actions, or the values used must be passed through 
> registers (which is kind of redundant). Also, a register may not be wide 
> enough for the data in question (e.g., labels).

Correct. The idea behind the nested actions is to make the connection
tracking and the modification of the external metadata atomic. Two
alternative approaches had been considered:
- One, set the metadata (ie ct_mark) prior to performing ct() action.
In this case, what we are saying is that the actual execution of this
is deferred until the next ct() action, which could be never. Cases
like "actions=set_field(1->ct_mark),ct(commit,zone=1),ct(commit,zone=2)"
would need to be defined: does the mark modification apply to both
connections, or only the first?
- Another option, set the metadata after performing ct(). This assumes
that the metadata modification applies to the most recent connection
tracking entry, which applies until the end of the pipeline. Either we
defer the ct() action execution until we are sure that we have
gathered all relevant set_field(...ct_*) actions, or we retain a
reference to the connection tracking information just in case a later
action applies the modification. Some interesting cases include, what
happens if "actions=ct(table=1),set_field(1->ct_mark)". It's difficult
to guarantee whether the "ct_mark" for the current connection is
modified before the forked packet re-enters table 1 or afterwards.

While it may require the use of additional registers, I think that
it's clearer which connections the ct_mark applies to in this case,
and dependencies don't need to be tracked along the pipeline.

>> +static enum ofperr
>> +decode_NXAST_RAW_CT(const struct nx_action_conntrack *nac, struct ofpbuf 
>> *out)
>> +{
>> +    const size_t ct_offset = ofpacts_pull(out);
>> +    struct ofpact_conntrack *conntrack;
>> +    struct ofpbuf openflow;
>> +    int error = 0;
>> +
>> +    conntrack = ofpact_put_CT(out);
>> +    conntrack->flags = ntohs(nac->flags);
>> +    error = decode_ct_zone(nac, conntrack);
>> +    if (error) {
>> +        goto out;
>> +    }
>> +    conntrack->recirc_table = nac->recirc_table;
>> +    conntrack->alg = ntohs(nac->alg);
>> +
>> +    ofpbuf_pull(out, sizeof(*conntrack));
>> +
>> +    /* XXX: version */
>
> What about the version?

When decoding the CT action, we have lost the context of which
OpenFlow version we are decoding. So we can't determine whether the
nested actions are valid for the current OpenFlow version. Having
version=OFP10_VERSION has worked in all of the cases I have tried so
far, perhaps because the nicira extensions extend all openflow
versions. I'm not sure if there's a more correct way to apply this, or
if passing OFP10_VERSION is sufficient here. Thoughts?

>> +    nac->recirc_table = conntrack->recirc_table;
>> +    nac->alg = htons(conntrack->alg);
>> +
>> +    len = ofpacts_put_openflow_actions(conntrack->actions,
>> +                                       ofpact_ct_get_action_len(conntrack),
>> +                                       out, ofp_version);
>> +    len += sizeof(*nac);
>> +    nac = ofpbuf_at(out, ofs, sizeof(*nac));
>> +    nac->len = htons(len);
>
> nac->len should include the conntrack action itself, too, not just the length 
> of the actions?

You mean like, "len += sizeof(*nac)"? Am I missing something here?

>> +/* Bits for 'flags' in struct nx_action_conntrack.
>> + *
>> + * If NX_CT_F_COMMIT is set, then the connection entry is moved from the
>> + * unconfirmed to confirmed list in the tracker. */
>> +enum nx_conntrack_flags {
>> +    NX_CT_F_COMMIT = 1 << 0,
>> +};
>> +
>> +/* Magic value for struct nx_action_conntrack 'recirc_table' field, to 
>> specify
>> + * that the packet should not be recirculated. This value should commonly be
>> + * used in conjunction with the NX_CT_F_COMMIT flag above. */
>> +#define NX_CT_RECIRC_NONE 0xFF
>> +
>
> Maybe this should be defined as OFPTT_ALL, as that is the only invalid table 
> number (0xff), i.e., NX_CT_RECIRC_NONE may not be defined as anything else, 
> right?

Agree, I'll change this.

>> +#define CT_MEMBERS                      \
>> +    uint16_t flags;                     \
>> +    uint16_t zone_imm;                  \
>> +    struct mf_subfield zone_src;        \
>> +    uint8_t recirc_table;               \
>> +    uint16_t alg
>> +
>
> Maybe put the struct member first to avoid internal padding (mf_subfield has 
> a pointer, so it will be 64-bit aligned on 64-bit systems)?

Sure.

>> +/* Used to perform build-time size checking for ofpact_conntrack. */
>> +struct ofpact_ct_size {
>> +    CT_MEMBERS;
>> +};
>> +
>> +/* OFPACT_CT.
>> + *
>> + * Used for NXAST_CT. */
>> +struct ofpact_conntrack {
>> +    struct ofpact ofpact;
>> +    CT_MEMBERS;
>
> This is slightly ugly, seems to be necessary.

Right, I'm not sure that we can calculate the correct padding size
without a trick like this.

>> bool dpid_from_string(const char *s, uint64_t *dpidp);
>> @@ -710,6 +717,18 @@ struct tcp_header {
>> };
>> BUILD_ASSERT_DECL(TCP_HEADER_LEN == sizeof(struct tcp_header));
>>
>> +/* Connection states */
>> +#define CS_NEW               0x01
>> +#define CS_ESTABLISHED       0x02
>> +#define CS_RELATED           0x04
>> +#define CS_INVALID           0x20
>> +#define CS_REPLY_DIR         0x40
>> +#define CS_TRACKED           0x80
>> +
>
> These seem to match the datapath interface flags. Is that an invariant? Which 
> ones we use in struct flow?

This is "coincidental". Strictly speaking the OpenFlow wire format for
these flags should be translated into an internal representation,
which should be subsequently translated into the equivalent datapath
representation, but in practice they're all the same. Perhaps this
would be more explicit to define them using the netlink interface
definitions:

#define CS_NEW    OVS_CS_F_NEW
....

>> diff --git a/ofproto/ofproto-dpif-rid.h b/ofproto/ofproto-dpif-rid.h
>> index e7d95bf..7754e68 100644
>> --- a/ofproto/ofproto-dpif-rid.h
>> +++ b/ofproto/ofproto-dpif-rid.h
>> @@ -142,6 +142,7 @@ struct recirc_state {
>>     struct recirc_metadata metadata; /* Flow metadata. */
>>     struct ofpbuf *stack;         /* Stack if any. */
>>     mirror_mask_t mirrors;        /* Mirrors already output. */
>> +    bool conntrack;               /* Conntrack occurred prior to recirc. */
>>
>
> Should this be also called ‘conntracked’?

OK.

>> static enum ofperr
>> +rule_check(struct rule *rule)
>> +{
>> +    struct ofproto_dpif *ofproto = ofproto_dpif_cast(rule->ofproto);
>> +    const struct odp_support *support;
>> +    struct match match;
>> +
>> +    minimatch_expand(&rule->cr.match, &match);
>> +    support = &ofproto_dpif_get_support(ofproto)->odp;
>> +
>> +    if ((match.wc.masks.ct_state && !support->ct_state)
>> +        || (match.wc.masks.ct_zone && !support->ct_zone)) {
>> +        return OFPERR_OFPBMC_BAD_FIELD;
>> +    }
>> +    if (match.wc.masks.ct_state & CS_UNSUPPORTED_MASK) {
>> +        return OFPERR_OFPBMC_BAD_MASK;
>> +    }
>> +
>> +    return 0;
>> +}
>
> This check seems somewhat expensive. Maybe get the masks from the minimatch 
> without expanding it (using MINIFLOW_GET_U8() and MINIFLOW_GET_U16()), like 
> this:
>
> static enum ofperr
> rule_check(struct rule *rule)
> {
>     ct_state_mask = MINIFLOW_GET_U8(&rule->cr.match.mask.masks, ct_state);
>     ct_zone_mask = MINIFLOW_GET_U16(&rule->cr.match.mask.masks, ct_zone);
>
>     if (ct_state_mask || ct_zone_mask) {
>         struct ofproto_dpif *ofproto = ofproto_dpif_cast(rule->ofproto);
>         const struct odp_support *support = 
> &ofproto_dpif_get_support(ofproto)->odp;
>
>         if ((ct_state_mask && !support->ct_state)
>             || (ct_zone_mask && !support->ct_zone)) {
>             return OFPERR_OFPBMC_BAD_FIELD;
>         }
>         if (ct_state_mask & CS_UNSUPPORTED_MASK) {
>             return OFPERR_OFPBMC_BAD_MASK;
>         }
>     }
>     return 0;
> }

Thanks for the suggestion, looks reasonable.

>> --- a/tests/system-traffic.at
>> +++ b/tests/system-traffic.at
>> @@ -139,3 +139,472 @@ NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 
>> -w 2 10.1.1.100 | FORMAT_PI
>>
>> OVS_TRAFFIC_VSWITCHD_STOP
>> AT_CLEANUP
>> +
>> +AT_SETUP([conntrack - controller])
>> +CHECK_CONNTRACK()
>> +OVS_TRAFFIC_VSWITCHD_START(
>> +   [set-fail-mode br0 standalone -- ])
>> +
>> +ADD_NAMESPACES(at_ns0, at_ns1)
>> +
>> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
>> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
>> +
>> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from 
>> ns1->ns0.
>> +AT_DATA([flows.txt], [dnl
>> +priority=1,action=drop
>> +priority=10,arp,action=normal
>> +in_port=1,udp,action=ct(commit),controller
>> +in_port=2,ct_state=-trk,udp,action=ct(table=0)
>> +in_port=2,ct_state=+trk+est-new,udp,action=controller
>
> I forgot the default priority value, would be useful to specify the priority 
> of all the flows to make this more readable.

OK.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to