Re: [DISCUSS] KIP-298: Error Handling in Connect

Arjun Satish Mon, 21 May 2018 13:26:49 -0700

OK. Let's simplify tolerance to simply have NONE or ALL values. For
extensions, we can open a KIP and implement in later versions.


Thanks a lot!

On Mon, May 21, 2018 at 1:18 PM, Ewen Cheslack-Postava <e...@confluent.io>
wrote:

> On Mon, May 21, 2018 at 12:39 PM Arjun Satish <arjun.sat...@gmail.com>
> wrote:
>
> > Thanks a lot, Ewen! I'll make sure the documentation is clear on the
> > differences between retries an tolerance.
> >
> > Do you think percentage would have the same problem as the one you
> brought
> > up? Also, if we say 10% tolerance, do we have to wait for the duration to
> > finish before failing the task, or should we fail as soon as we hit 10%
> > error.
> >
> Yeah, percent might not be right either. I'd probably only reasonably set
> it to 0% or 100% as something in between seems difficult to justify.
>
>
> >
> > Alternatively, do you think making tolerance an Enum would be simpler?
> > Where it's values are NONE (any errors kill), ALL (tolerate all errors
> and
> > skip records) and FIRST (tolerate the first error, but fail after that)?
> >
>
> I do think the values I would ever use are limited enough to just be an
> enum. Not sure if anyone has use cases for larger positive values.
>
> -Ewen
>
>
> >
> > Best,
> >
> >
> > On Mon, May 21, 2018 at 11:28 AM, Ewen Cheslack-Postava <
> e...@confluent.io
> > >
> > wrote:
> >
> > > Arjun,
> > >
> > > Understood on retries vs tolerance -- though I suspect this will end up
> > > being a bit confusing to users as well. It's two levels of error
> handling
> > > which is what tripped me up.
> > >
> > > One last comment on KIP (which otherwise looks good): for the tolerance
> > > setting, do we want it to be an absolute value or something like a
> > > percentage? Given the current way of setting things, I'm not sure I'd
> > ever
> > > set it to anything but -1 or 0, with maybe 1 as an easy option for
> > > restarting a connector to get it past one bad message, then reverting
> > back
> > > to -1 or 0.
> > >
> > > -Ewen
> > >
> > > On Mon, May 21, 2018 at 11:01 AM Arjun Satish <arjun.sat...@gmail.com>
> > > wrote:
> > >
> > > > Hey Jason,
> > > >
> > > > Thanks for your comments. Please find answers inline:
> > > >
> > > > On Mon, May 21, 2018 at 9:52 AM, Jason Gustafson <ja...@confluent.io
> >
> > > > wrote:
> > > >
> > > > > Hi Arjun,
> > > > >
> > > > > Thanks for the KIP. Just a few comments/questions:
> > > > >
> > > > > 1. The proposal allows users to configure the number of retries. I
> > > > usually
> > > > > find it easier as a user to work with timeouts since it's difficult
> > to
> > > > know
> > > > > how long a retry might take. Have you considered adding a timeout
> > > option
> > > > > which would retry until the timeout expires?
> > > > >
> > > >
> > > > Good point. Updated the KIP.
> > > >
> > > > 2. The configs are named very generically (e.g.
> errors.retries.limit).
> > Do
> > > > > you think it will be clear to users what operations these configs
> > apply
> > > > to?
> > > > >
> > > >
> > > > As of now, these configs are applicable to all operations in the
> > > connector
> > > > pipeline (as mentioned in the proposed changes section). We decided
> not
> > > to
> > > > have per operation limit because of the additional config complexity.
> > > >
> > > >
> > > > > 3. I wasn't entirely clear what messages are stored in the dead
> > letter
> > > > > queue. It sounds like it includes both configs and messages since
> we
> > > have
> > > > > errors.dlq.include.configs? Is there a specific schema you have in
> > > mind?
> > > > >
> > > >
> > > > This has been addressed in the KIP. The DLQ will now contain only raw
> > > > messages (no additional context). We are also supporting DLQs only
> for
> > > sink
> > > > connectors now.
> > > >
> > > >
> > > > > 4. I didn't see it mentioned explicitly in the KIP, but I assume
> the
> > > > > tolerance metrics are reset after every task rebalance?
> > > > >
> > > >
> > > > Great question. Yes, we will reset the tolerance metrics on every
> > > > rebalance.
> > > >
> > > >
> > > > > 5. I wonder if we can do without errors.tolerance.limit. You can
> get
> > a
> > > > > similar effect using errors.tolerance.rate.limit if you allow
> longer
> > > > > durations. I'm not sure how useful an absolute counter is in
> > practice.
> > > > >
> > > >
> > > > Yeah, the rate limit does subsume the features offered by the
> absolute
> > > > counter. Removed it.
> > > >
> > > >
> > > > >
> > > > > Thanks,
> > > > > Jason
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-298: Error Handling in Connect

Reply via email to