Re: [ovs-dev] [PATCH v3 1/2] ofproto: Allow xlate_actions() to fail.

Jarno Rajahalme Wed, 25 Nov 2015 15:34:19 -0800

> On Nov 25, 2015, at 3:14 PM, Joe Stringer <j...@ovn.org> wrote:
> 
> On 25 November 2015 at 15:12, Joe Stringer <joestrin...@nicira.com> wrote:
>> On 25 November 2015 at 15:06, Jarno Rajahalme <ja...@ovn.org> wrote:
>>> 
>>>> On Nov 25, 2015, at 2:58 PM, Joe Stringer <j...@ovn.org> wrote:
>>>> 
>>>> On 25 November 2015 at 11:23, Jarno Rajahalme <ja...@ovn.org 
>>>> <mailto:ja...@ovn.org>> wrote:
>>>>> 
>>>>> On Nov 25, 2015, at 11:11 AM, Jarno Rajahalme <ja...@ovn.org> wrote:
>>>>> 
>>>>> 
>>>>> On Nov 25, 2015, at 10:52 AM, Joe Stringer <j...@ovn.org> wrote:
>>>>> 
>>>>> On 25 November 2015 at 10:31, Jarno Rajahalme <ja...@ovn.org> wrote:
>>>>> 
>>>>> 
>>>>> On Nov 24, 2015, at 5:02 PM, Joe Stringer <j...@ovn.org> wrote:
>>>>> 
>>>>> On 24 November 2015 at 13:41, Jarno Rajahalme <ja...@ovn.org> wrote:
>>>>> 
>>>>> Sometimes xlate_actions() fails due to too deep recursion, too many
>>>>> MPLS labels, or missing recirculation context.  Make xlate_actions()
>>>>> clear out the produced odp actions in these cases to make it easy for
>>>>> the caller to install a drop flow (instead or installing a flow with
>>>>> partially translated actions).  Also, return a specific error code, so
>>>>> that the error can be properly propagated where meaningful.
>>>>> 
>>>>> Before this patch it was possible that the revalidation installed a
>>>>> flow with a recirculation ID with an invalid recirc ID (== 0), due to
>>>>> the introduction of in-place modification in commit 43b2f131a229
>>>>> (ofproto: Allow in-place modifications of datapath flows).
>>>>> 
>>>>> Signed-off-by: Jarno Rajahalme <ja...@ovn.org>
>>>>> 
>>>>> 
>>>>> Should this also set the error when receiving packets on a mirror port
>>>>> in xlate_actions()? Or when receiving tagged VLAN traffic that doesn't
>>>>> correspond to the port's vlan tag? Or when a group has no live bucket?
>>>>> Are there any other cases that should also be covered? (I just scanned
>>>>> across ofproto/ofproto-dpif-xlate.c looking for cases where we're
>>>>> already logging that we drop the packet, but maybe there's a reasoning
>>>>> behind not including these - if so, please enlighten me)
>>>>> 
>>>>> 
>>>>> No reasoning for missing those, I just did not notice them. Thanks for
>>>>> pointing them out.
>>>>> 
>>>>> 
>>>>> OK, I thought it may have been something like "expected errors" vs.
>>>>> "unexpected errors".
>>>>> 
>>>>> 
>>>>> Looking into these I noticed this to be the case. Must discern whether to
>>>>> fail just the individual action v.s. the whole pipeline.
>>>>> 
>>>>> 
>>>>> How about this incremental to cover two cases here (rest are “expected
>>>>> errors” IMO):
>>>>> 
>>>>> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
>>>>> index 36a6fbc..2908339 100644
>>>>> --- a/ofproto/ofproto-dpif-xlate.c
>>>>> +++ b/ofproto/ofproto-dpif-xlate.c
>>>>> @@ -336,6 +336,10 @@ const char *xlate_strerror(enum xlate_error error)
>>>>>        return "Recirculation conflict";
>>>>>    case XLATE_TOO_MANY_MPLS_LABELS:
>>>>>        return "Too many MPLS labels";
>>>>> +    case XLATE_BUCKET_CHAINING_TOO_DEEP:
>>>>> +        return "Bucket chaining too deep";
>>>>> +    case XLATE_NO_INPUT_BUNDLE:
>>>>> +        return "No input bundle";
>>>>>    }
>>>>>    return "Unknown error";
>>>>> }
>>>>> @@ -1444,10 +1448,9 @@ bucket_is_alive(const struct xlate_ctx *ctx,
>>>>>                struct ofputil_bucket *bucket, int depth)
>>>>> {
>>>>>    if (depth >= MAX_LIVENESS_RECURSION) {
>>>>> -        static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
>>>>> -
>>>>> -        VLOG_WARN_RL(&rl, "bucket chaining exceeded %d links",
>>>>> -                     MAX_LIVENESS_RECURSION);
>>>>> +        XLATE_REPORT_ERROR(ctx, "bucket chaining exceeded %d links",
>>>>> +                           MAX_LIVENESS_RECURSION);
>>>>> +        ctx->error = XLATE_BUCKET_CHAINING_TOO_DEEP;
>>>>>        return false;
>>>>>    }
>>>>> 
>>>>> @@ -2323,7 +2326,8 @@ xlate_normal(struct xlate_ctx *ctx)
>>>>>    in_xbundle = lookup_input_bundle(ctx->xbridge, flow->in_port.ofp_port,
>>>>>                                     ctx->xin->packet != NULL, &in_port);
>>>>>    if (!in_xbundle) {
>>>>> -        xlate_report(ctx, "no input bundle, dropping");
>>>>> +        XLATE_REPORT_ERROR(ctx, "no input bundle, dropping");
>>>>> +        ctx->error = XLATE_NO_INPUT_BUNDLE;
>>>>>        return;
>>>>>    }
>>>>> 
>>>>> 
>>>>> The last one is debatable, as setting the error fails the whole 
>>>>> translation
>>>>> rather than just the normal action. But this is most likely an 
>>>>> configuration
>>>>> error, so maybe failing the whole pipeline (and installing a drop flow) is
>>>>> the right thing to do here?
>>>> 
>>>> Jarno and I discussed this offline, and I'll try to summarise here.
>>>> Broadly speaking, we're talking about the decision between failing an
>>>> individual (piece of an) action or completely failing the action
>>>> processing for the flow. And I think arguably the approach should be
>>>> that if it is a serious error such as running out of resources or an
>>>> internal conflict of recirc IDs, then we should fail the entire action
>>>> processing. In this case it will have two user-visible effects:
>>>> 1) ofproto/trace will tell the user which serious condition is being
>>>> triggered that causes dropping of the flow
>>>> 2) OpenFlow controllers attempting packet_out could be notified that
>>>> the error occurred (rather than silently failing like currently)
>>>> 
>>>> However, in the two cases in the incremental patch here, the actions
>>>> inherently have some ambiguity as to whether they successfully execute
>>>> (eg output) or not. The more obvious case is in the bucket_is_alive()
>>>> logic, where recursion will cause a bucket to be not used. If a bucket
>>>> is not live in the spec, this doesn't mean that the entire flow should
>>>> stop processing. In the case of normal, I'd argue it's very similar in
>>>> that 'normal' doesn't specifically attempt to output to a particular
>>>> port; sending packets out to different ports may fail for different
>>>> reasons, but this shouldn't prevent later actions in the actions list
>>>> from being executed.
>>>> 
>>>> I think the latter cases should be reported for ofproto/trace, though.
>>>> 
>>>> Looking back across this thread, it looks not far off your reasoning
>>>> described earlier so I think we're converging on the same view. Does
>>>> this sound like a fair approach?
>>>> 
>>>> --
>>>> 
>>>> In the mirror case, the point is moot because do_xlate_actions() isn't
>>>> even called in that case, so it's purely a matter of whether we want
>>>> to return the error up the stack or not. Maybe that should be reported
>>>> for ofproto/trace as well.
>>>> 
>>>> I didn't see any other cases that might need handling through this.
>>> 
>>> So the only ask I see here is that more of the cases of individual actions 
>>> bailing out should have xlate_report() calls on them. To me this sounds 
>>> like a different patch, not directly related to erroring out the whole 
>>> translation. As such I hope to get an Ack on the original patch of this now 
>>> lengthy discussion…
>> 
>> I agree.
>> 
>> Acked-by: Joe Stringer <j...@ovn.org>
> 
> Ah, mismatch between my sender email an the ack. Here:
> 
> Acked-by: Joe Stringer <j...@ovn.org>


Thanks for the thorough review, pushed to master.

  Jarno


_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Re: [ovs-dev] [PATCH v3 1/2] ofproto: Allow xlate_actions() to fail.

Reply via email to