> On Sep 11, 2015, at 1:07 PM, Joe Stringer <joestrin...@nicira.com> wrote:
> 
> On 10 September 2015 at 19:03, Jarno Rajahalme <jrajaha...@nicira.com> wrote:
>> 
>>> On Sep 9, 2015, at 7:00 PM, Joe Stringer <joestrin...@nicira.com> wrote:
>>> @@ -672,6 +705,8 @@ struct ovs_action_push_tnl {
>>> * indicate the new packet contents. This could potentially still be
>>> * %ETH_P_MPLS if the resulting MPLS label stack is not empty.  If there
>>> * is no MPLS label stack, as determined by ethertype, no action is taken.
>>> + * @OVS_ACTION_ATTR_CT: Track the connection. Populate the 
>>> conntrack-related
>>> + * entries in the flow key.
>>> *
>>> * Only a single header can be set with a single %OVS_ACTION_ATTR_SET.  Not 
>>> all
>>> * fields within a header are modifiable, e.g. the IPv4 protocol and fragment
>>> @@ -702,6 +737,7 @@ enum ovs_action_attr {
>>>                                     * data immediately followed by a mask.
>>>                                     * The data must be zero for the unmasked
>>>                                     * bits. */
>>> +     OVS_ACTION_ATTR_CT,           /* One nested OVS_CT_ATTR_* . */
>>> 
>> 
>> I guess this should be “Nested OVS_CT_ATTR_*”, as there can be multiple 
>> nested attributes?
> 
> Right you are, I'll follow up with this on upstream too.
> 
> 
>>> diff --git a/lib/flow.h b/lib/flow.h
>>> index d8632ff..2d6d3ee 100644
>>> --- a/lib/flow.h
>>> +++ b/lib/flow.h
>>> @@ -103,8 +103,10 @@ struct flow {
>>>    union flow_in_port in_port; /* Input port.*/
>>>    uint32_t recirc_id;         /* Must be exact match. */
>>>    uint32_t conj_id;           /* Conjunction ID. */
>>> +    uint16_t ct_zone;           /* Connection Zone. */
>>> +    uint8_t ct_state;           /* Connection state. */
>>> +    uint8_t pad1[3];            /* Pad to 64 bits. */
>>>    ofp_port_t actset_output;   /* Output port in action set. */
>>> -    uint8_t pad1[6];            /* Pad to 64 bits. */
>>> 
>> 
>> It would be slightly better to put ct_zone and ct_state right after 
>> recirc_id. This would make miniflows smaller, as the recirc_id, ct_zone, and 
>> ct_state would all be in the same miniflow data unit (conj_id and 
>> actset_output are userspace-only fields, so they would be zeroes for 
>> datapath miniflows).
>> 
>> We should probably make ct_state a 16 bit field (uint16_t) in struct flow 
>> already, for future-proofness, even if the kernel interface is still 8 bits.
>> 
>> Also, you should bump FLOW_WC_SEQ to make sure all the relevant places are 
>> updated for the new fields.
> 
> OK, I'll adjust these.
> 
>>> --- a/lib/meta-flow.h
>>> +++ b/lib/meta-flow.h
>>> @@ -703,6 +703,51 @@ enum OVS_PACKED_ENUM mf_field_id {
>>>     */
>>>    MFF_PKT_MARK,
>>> 
>>> +    /* "ct_state".
>>> +     *
>>> +     * Connection tracking state.  The field is populated by the NXAST_CT
>>> +     * action. The following bit values describe the state of the 
>>> connection:
>>> +     *
>>> +     *   - New (0x01): This is the beginning of a new connection.
>>> +     *   - Established (0x02): This is part of an already existing 
>>> connection.
>>> +     *   - Related (0x04): This is a separate connection that is related 
>>> to an
>>> +     *                     existing connection.
>>> +     *   - Invalid (0x20): This flow could not be associated with a 
>>> connection.
>>> +     *                     This could be set for a variety of reasons,
>>> +     *                     including (but not limited to):
>>> +     *                     - L3/L4 protocol handler is not 
>>> loaded/unavailable.
>>> +     *                     - L3/L4 protocol handler determines that the 
>>> packet
>>> +     *                       is malformed or invalid for the current FSM 
>>> stage.
>>> +     *                     - Packets are unexpected length for protocol.
>>> +     *   - Reply (0x40): This flow is in the reply direction, ie it did not
>>> +     *                   initiate the connection.
>>> +     *   - Tracked (0x80): Connection tracking has occurred.
>>> +     *
> 
> This should also mention that "related" catches related packets such
> as ICMP destination unreachable messages (this is "related" to the
> connection which initiated the request to the unreachable
> destination). I'll fix that up, here and in the manual.
> 
>> Have we documented anywhere which of the bits are mutually exclusive and 
>> how. E.g.,
>> - If -trk, all the other bits are also zeroes (also invalid?)
>> - new and established are mutually exclusive (+new+est is not possible)
> 
> Right.
> 
>> - new and reply are mutually exclusive (reply direction is defined as 
>> opposite of the initial +new)
> 
> Strictly speaking I don't think this is the case. I don't have an
> example, but it is possible for this to be set in the conntrack info
> specified in the Linux interface, see
> linux/netfilter/nf_conntrack_common.h. ct_state is translated roughly
> from this and potentially other pieces of conntrack information.
> 

While IP_CT_NEW_REPLY is defined, it is only used by openvswitch/connntrack.c, 
and the next line says “(no NEW in reply dirn.)”

So it looks like IP_CT_NEW_REPLY should not even be defined. I have prepared a 
patch to remove the IP_CT_NEW_REPLY definition, but have no immediate plans to 
send it out.

As said, it seems to me that reply direction is defined as the opposing 
direction to the packet that initially created the conntrack entry. I’m still a 
bit confused about related vs. related_reply, though.

>> - related and new can be set at the same time (but does related remain set 
>> for the duration of the connection, or is it also mutually exclusive with 
>> established?)
> 
> Related can be present with any other bit. I believe that it remains
> set for the duration of the connection. Looks like a good candidate
> for another test in the testsuite.
> 

ctinfo enum can not hold related and established at the same time, though.

>> - apparently related and reply are not mutually exclusive, but when are they 
>> set at the same time?
> 
> I see "related" as another bit that is independent of all of the
> others; whether the connection is related or not, the packets can
> still be part of a new connection, established, reply, etc. Perhaps
> the more pressing question is exactly what "reply" means in this
> context: Is it referring to the direction of the original connection,
> or only the related connection? This is something I can further
> clarify.
> 

in ctinfo enum related, new, and established are all mutually exclusive. In OVS 
ct_state this is not always the case, though, as we explicitly set the new bit 
for related, too. I agree that it would be nice to be able to persist the 
related bit also when the connection state is “established”.

>> - are invalid and tracked mutually exclusive?
> 
> This question suggests that the description of "tracked" is
> insufficient, so I'll amend it. See below for a more comprehensive
> definition.
> 
>> Also, what are the possible transitions from one state to another?
> 
> From the perspective of the packet, there are two states: untracked
> and tracked. Untracked means that the packet has not yet been sent
> through the connection tracker. If the packet is sent to the
> connection tracker via "ct(table=X)", the packet becomes tracked. It
> remains tracked until the packet egresses the switch.
> 
> From the perspective of connections, connection state can only be
> identified once the packet is tracked. There are two connection
> states: uncommitted and committed. If a packet has the "new" ct_state
> flag, then its corresponding connection is uncommitted, meaning that
> the connection tracker cannot identify any other packets as belonging
> to the same or related connections. If an uncommitted connection is
> committed via "ct(commit)", then subsequent packets for the connection
> or related connections may be correlated. In general, any connection
> may transition back to uncommitted if no packets are seen for some
> time. For specific connection types, such as TCP, there may be
> particular packet types that also cause committed connecitons to
> transition back to the uncommitted state. The exact operation of these
> transitions back to uncommitted are outside the scope of the OpenFlow
> definition.
> 
> The "new" and "invalid" flags map to the uncommitted connection state.
> The "established" and "reply" flags map to the committed connection
> state. The "related" flag may be present for connections in either
> state.
> 
>>> @@ -4448,6 +4455,245 @@ format_DEBUG_RECIRC(const struct ofpact_null *a 
>>> OVS_UNUSED, struct ds *s)
>>> {
>>>    ds_put_cstr(s, "debug_recirc");
>>> }
>>> +
>>> +/* Action structure for NXAST_CT.
>>> + *
>>> + * Pass traffic to the connection tracker.
>>> + *
>>> + * The "zone" specifies a context within which the tracking is done:
>>> + *
>>> + *      If 'zone_src' is nonzero, this specifies that the zone should be
>>> + *      sourced from a field zone_src[ofs:ofs+nbits]. The format and 
>>> semantics
>>> + *      of 'zone_src' and 'zone_ofs_nbits' are similar to those for the
>>> + *      NXAST_REG_LOAD action. The acceptable nxm_header values for 
>>> 'zone_src'
>>> + *      are the same as the acceptable nxm_header values for the 'src' 
>>> field of
>>> + *      NXAST_REG_MOVE.
>>> + *
>>> + *      If 'zone_src' is zero, then the value of 'zone_imm' will be used 
>>> as the
>>> + *      connection tracking zone.
>>> + *
>>> + * If "recirc_table" has a value other than NX_CT_RECIRC_NONE, then this 
>>> field
>>> + * specifies that the packet should be logically cloned after the packet is
>> 
>> The packet is logically cloned *before* it is sent through the connection 
>> tracker,
>> as only the cloned packet gets the ct_state and ct_zone, right?
> 
> True. Next time I'll go through and try to ensure that each of the
> commit messages, ofp-actions.c and the manpages are consistent on
> their descriptions of this.
> 
> Strictly speaking, if the table is specified, then it is like an
> output action. The pipeline splits prior to the action; one copy goes
> through the action and continues to the logical end of that
> pipeline(in output this means send the packet and terminate, in ct
> this means send through the connection tracker and return to the
> OpenFlow pipeline at the table specified). The original packet does
> not go through that action, and instead skips straight to the next
> action in the list. If it is the last action, then the pipeline
> terminates.
> 
>>> + * sent through the connection tracker, and pipeline processing for the 
>>> cloned
>>> + * packet will continue in the OpenFlow table specified by this value. The
>>> + * NXM_NX_CT_* fields will be populated for the subsequent processing. It 
>>> is
>>> + * strongly recommended that this table is later than the current table, to
>>> + * prevent loops.
>>> + *
>> 
>> ‘alg’ and ofpacts are introduced in later patches, and I guess it is too 
>> much work to not deal with them here yet?
> 
> I could go back and adjust the definition to place padding there and
> drop the ofpacts instead.

Seems like a lot of work, though, essentially only for reviewing purposes. Up 
to you :-)

> 
>> Anyway, a thought about the nested actions just popped in my mind: The fact 
>> that the actions are nested means that they all must be produced by a single 
>> OpenFlow rule. I.e., the same rule that does conntrack must also set the 
>> labels, etc. This means that either there will be a potentially huge number 
>> of rules with conntrack actions, or the values used must be passed through 
>> registers (which is kind of redundant). Also, a register may not be wide 
>> enough for the data in question (e.g., labels).
> 
> Correct. The idea behind the nested actions is to make the connection
> tracking and the modification of the external metadata atomic. Two
> alternative approaches had been considered:
> - One, set the metadata (ie ct_mark) prior to performing ct() action.
> In this case, what we are saying is that the actual execution of this
> is deferred until the next ct() action, which could be never. Cases
> like "actions=set_field(1->ct_mark),ct(commit,zone=1),ct(commit,zone=2)"
> would need to be defined: does the mark modification apply to both
> connections, or only the first?
> - Another option, set the metadata after performing ct(). This assumes
> that the metadata modification applies to the most recent connection
> tracking entry, which applies until the end of the pipeline. Either we
> defer the ct() action execution until we are sure that we have
> gathered all relevant set_field(...ct_*) actions, or we retain a
> reference to the connection tracking information just in case a later
> action applies the modification. Some interesting cases include, what
> happens if "actions=ct(table=1),set_field(1->ct_mark)". It's difficult
> to guarantee whether the "ct_mark" for the current connection is
> modified before the forked packet re-enters table 1 or afterwards.
> 
> While it may require the use of additional registers, I think that
> it's clearer which connections the ct_mark applies to in this case,
> and dependencies don't need to be tracked along the pipeline.
> 
>>> +static enum ofperr
>>> +decode_NXAST_RAW_CT(const struct nx_action_conntrack *nac, struct ofpbuf 
>>> *out)
>>> +{
>>> +    const size_t ct_offset = ofpacts_pull(out);
>>> +    struct ofpact_conntrack *conntrack;
>>> +    struct ofpbuf openflow;
>>> +    int error = 0;
>>> +
>>> +    conntrack = ofpact_put_CT(out);
>>> +    conntrack->flags = ntohs(nac->flags);
>>> +    error = decode_ct_zone(nac, conntrack);
>>> +    if (error) {
>>> +        goto out;
>>> +    }
>>> +    conntrack->recirc_table = nac->recirc_table;
>>> +    conntrack->alg = ntohs(nac->alg);
>>> +
>>> +    ofpbuf_pull(out, sizeof(*conntrack));
>>> +
>>> +    /* XXX: version */
>> 
>> What about the version?
> 
> When decoding the CT action, we have lost the context of which
> OpenFlow version we are decoding. So we can't determine whether the
> nested actions are valid for the current OpenFlow version. Having
> version=OFP10_VERSION has worked in all of the cases I have tried so
> far, perhaps because the nicira extensions extend all openflow
> versions. I'm not sure if there's a more correct way to apply this, or
> if passing OFP10_VERSION is sufficient here. Thoughts?
> 

Not sure. Maybe test with each OpenFlow version and see what happens?

>>> +    nac->recirc_table = conntrack->recirc_table;
>>> +    nac->alg = htons(conntrack->alg);
>>> +
>>> +    len = ofpacts_put_openflow_actions(conntrack->actions,
>>> +                                       ofpact_ct_get_action_len(conntrack),
>>> +                                       out, ofp_version);
>>> +    len += sizeof(*nac);
>>> +    nac = ofpbuf_at(out, ofs, sizeof(*nac));
>>> +    nac->len = htons(len);
>> 
>> nac->len should include the conntrack action itself, too, not just the 
>> length of the actions?
> 
> You mean like, "len += sizeof(*nac)"? Am I missing something here?
> 

I simply missed the line you refer to, sorry.

>>> +/* Bits for 'flags' in struct nx_action_conntrack.
>>> + *
>>> + * If NX_CT_F_COMMIT is set, then the connection entry is moved from the
>>> + * unconfirmed to confirmed list in the tracker. */
>>> +enum nx_conntrack_flags {
>>> +    NX_CT_F_COMMIT = 1 << 0,
>>> +};
>>> +
>>> +/* Magic value for struct nx_action_conntrack 'recirc_table' field, to 
>>> specify
>>> + * that the packet should not be recirculated. This value should commonly 
>>> be
>>> + * used in conjunction with the NX_CT_F_COMMIT flag above. */
>>> +#define NX_CT_RECIRC_NONE 0xFF
>>> +
>> 
>> Maybe this should be defined as OFPTT_ALL, as that is the only invalid table 
>> number (0xff), i.e., NX_CT_RECIRC_NONE may not be defined as anything else, 
>> right?
> 
> Agree, I'll change this.
> 
>>> +#define CT_MEMBERS                      \
>>> +    uint16_t flags;                     \
>>> +    uint16_t zone_imm;                  \
>>> +    struct mf_subfield zone_src;        \
>>> +    uint8_t recirc_table;               \
>>> +    uint16_t alg
>>> +
>> 
>> Maybe put the struct member first to avoid internal padding (mf_subfield has 
>> a pointer, so it will be 64-bit aligned on 64-bit systems)?
> 
> Sure.
> 
>>> +/* Used to perform build-time size checking for ofpact_conntrack. */
>>> +struct ofpact_ct_size {
>>> +    CT_MEMBERS;
>>> +};
>>> +
>>> +/* OFPACT_CT.
>>> + *
>>> + * Used for NXAST_CT. */
>>> +struct ofpact_conntrack {
>>> +    struct ofpact ofpact;
>>> +    CT_MEMBERS;
>> 
>> This is slightly ugly, seems to be necessary.
> 
> Right, I'm not sure that we can calculate the correct padding size
> without a trick like this.
> 
>>> bool dpid_from_string(const char *s, uint64_t *dpidp);
>>> @@ -710,6 +717,18 @@ struct tcp_header {
>>> };
>>> BUILD_ASSERT_DECL(TCP_HEADER_LEN == sizeof(struct tcp_header));
>>> 
>>> +/* Connection states */
>>> +#define CS_NEW               0x01
>>> +#define CS_ESTABLISHED       0x02
>>> +#define CS_RELATED           0x04
>>> +#define CS_INVALID           0x20
>>> +#define CS_REPLY_DIR         0x40
>>> +#define CS_TRACKED           0x80
>>> +
>> 
>> These seem to match the datapath interface flags. Is that an invariant? 
>> Which ones we use in struct flow?
> 
> This is "coincidental". Strictly speaking the OpenFlow wire format for
> these flags should be translated into an internal representation,
> which should be subsequently translated into the equivalent datapath
> representation, but in practice they're all the same. Perhaps this
> would be more explicit to define them using the netlink interface
> definitions:
> 
> #define CS_NEW    OVS_CS_F_NEW
> ….

Would the current code work if the mapping was not 1:1? If yes, then it does 
not matter. If not, then we should at least assert that the mapping stays 1:1 
also in the future.

>>> diff --git a/ofproto/ofproto-dpif-rid.h b/ofproto/ofproto-dpif-rid.h
>>> index e7d95bf..7754e68 100644
>>> --- a/ofproto/ofproto-dpif-rid.h
>>> +++ b/ofproto/ofproto-dpif-rid.h
>>> @@ -142,6 +142,7 @@ struct recirc_state {
>>>    struct recirc_metadata metadata; /* Flow metadata. */
>>>    struct ofpbuf *stack;         /* Stack if any. */
>>>    mirror_mask_t mirrors;        /* Mirrors already output. */
>>> +    bool conntrack;               /* Conntrack occurred prior to recirc. */
>>> 
>> 
>> Should this be also called ‘conntracked’?
> 
> OK.
> 
>>> static enum ofperr
>>> +rule_check(struct rule *rule)
>>> +{
>>> +    struct ofproto_dpif *ofproto = ofproto_dpif_cast(rule->ofproto);
>>> +    const struct odp_support *support;
>>> +    struct match match;
>>> +
>>> +    minimatch_expand(&rule->cr.match, &match);
>>> +    support = &ofproto_dpif_get_support(ofproto)->odp;
>>> +
>>> +    if ((match.wc.masks.ct_state && !support->ct_state)
>>> +        || (match.wc.masks.ct_zone && !support->ct_zone)) {
>>> +        return OFPERR_OFPBMC_BAD_FIELD;
>>> +    }
>>> +    if (match.wc.masks.ct_state & CS_UNSUPPORTED_MASK) {
>>> +        return OFPERR_OFPBMC_BAD_MASK;
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>> 
>> This check seems somewhat expensive. Maybe get the masks from the minimatch 
>> without expanding it (using MINIFLOW_GET_U8() and MINIFLOW_GET_U16()), like 
>> this:
>> 
>> static enum ofperr
>> rule_check(struct rule *rule)
>> {
>>    ct_state_mask = MINIFLOW_GET_U8(&rule->cr.match.mask.masks, ct_state);
>>    ct_zone_mask = MINIFLOW_GET_U16(&rule->cr.match.mask.masks, ct_zone);
>> 
>>    if (ct_state_mask || ct_zone_mask) {
>>        struct ofproto_dpif *ofproto = ofproto_dpif_cast(rule->ofproto);
>>        const struct odp_support *support = 
>> &ofproto_dpif_get_support(ofproto)->odp;
>> 
>>        if ((ct_state_mask && !support->ct_state)
>>            || (ct_zone_mask && !support->ct_zone)) {
>>            return OFPERR_OFPBMC_BAD_FIELD;
>>        }
>>        if (ct_state_mask & CS_UNSUPPORTED_MASK) {
>>            return OFPERR_OFPBMC_BAD_MASK;
>>        }
>>    }
>>    return 0;
>> }
> 
> Thanks for the suggestion, looks reasonable.
> 
>>> --- a/tests/system-traffic.at
>>> +++ b/tests/system-traffic.at
>>> @@ -139,3 +139,472 @@ NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 
>>> -w 2 10.1.1.100 | FORMAT_PI
>>> 
>>> OVS_TRAFFIC_VSWITCHD_STOP
>>> AT_CLEANUP
>>> +
>>> +AT_SETUP([conntrack - controller])
>>> +CHECK_CONNTRACK()
>>> +OVS_TRAFFIC_VSWITCHD_START(
>>> +   [set-fail-mode br0 standalone -- ])
>>> +
>>> +ADD_NAMESPACES(at_ns0, at_ns1)
>>> +
>>> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
>>> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
>>> +
>>> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from 
>>> ns1->ns0.
>>> +AT_DATA([flows.txt], [dnl
>>> +priority=1,action=drop
>>> +priority=10,arp,action=normal
>>> +in_port=1,udp,action=ct(commit),controller
>>> +in_port=2,ct_state=-trk,udp,action=ct(table=0)
>>> +in_port=2,ct_state=+trk+est-new,udp,action=controller
>> 
>> I forgot the default priority value, would be useful to specify the priority 
>> of all the flows to make this more readable.
> 
> OK.

Btw. I got the “make check-kernel” run successfully for this patch :-)

  Jarno

_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to