Re: [DISCUSS] KIP-298: Error Handling in Connect

Arjun Satish Tue, 22 May 2018 09:56:44 -0700

After some discussions with Jason, we decided that the error handler should
retry an operation only it throws a RetriableException. At the same time,
any Exception in the Transformation and Converter step can be tolerated (by
skipping the record). I updated the table in the Proposed Changes section
to reflect this:


https://cwiki.apache.org/confluence/display/KAFKA/KIP-298%3A+Error+Handling+in+Connect#KIP-298:ErrorHandlinginConnect-ProposedChanges

Thanks,

On Mon, May 21, 2018 at 3:12 PM, Arjun Satish <[email protected]>
wrote:

> Hey Jason,
>
> This KIP does take serialization errors to be retriable. The typical use
> case is that Schema Registry can have a bad/unavailable schema, which can
> be corrected over time.
>
> But since the converters throw DataExceptions for all failures, it is hard
> to determine what caused these errors. Hence, we are going to retry on any
> Exception thrown from a converter.
>
> Hope that works.
>
> Best,
>
>
> On Mon, May 21, 2018 at 2:15 PM, Jason Gustafson <[email protected]>
> wrote:
>
>> Thanks Arjun. I had one additional question. High level, I'm wondering if
>> it makes sense to treat processing errors such as serialization failures
>> the same as IO errors. In the former case, retrying typically doesn't help
>> because the processing is deterministic. In the latter case, the failure
>> may be downstream (e.g. a kafka partition may be temporarily unavailable).
>> As a user, I would probably want the option to skip over processing
>> failures, but retry indefinitely if the downstream system is unavailable.
>> Is that use case supported?
>>
>> Thanks,
>> Jason
>>
>>
>>
>> On Mon, May 21, 2018 at 12:39 PM, Arjun Satish <[email protected]>
>> wrote:
>>
>> > Thanks a lot, Ewen! I'll make sure the documentation is clear on the
>> > differences between retries an tolerance.
>> >
>> > Do you think percentage would have the same problem as the one you
>> brought
>> > up? Also, if we say 10% tolerance, do we have to wait for the duration
>> to
>> > finish before failing the task, or should we fail as soon as we hit 10%
>> > error.
>> >
>> > Alternatively, do you think making tolerance an Enum would be simpler?
>> > Where it's values are NONE (any errors kill), ALL (tolerate all errors
>> and
>> > skip records) and FIRST (tolerate the first error, but fail after that)?
>> >
>> > Best,
>> >
>> >
>> > On Mon, May 21, 2018 at 11:28 AM, Ewen Cheslack-Postava <
>> [email protected]
>> > >
>> > wrote:
>> >
>> > > Arjun,
>> > >
>> > > Understood on retries vs tolerance -- though I suspect this will end
>> up
>> > > being a bit confusing to users as well. It's two levels of error
>> handling
>> > > which is what tripped me up.
>> > >
>> > > One last comment on KIP (which otherwise looks good): for the
>> tolerance
>> > > setting, do we want it to be an absolute value or something like a
>> > > percentage? Given the current way of setting things, I'm not sure I'd
>> > ever
>> > > set it to anything but -1 or 0, with maybe 1 as an easy option for
>> > > restarting a connector to get it past one bad message, then reverting
>> > back
>> > > to -1 or 0.
>> > >
>> > > -Ewen
>> > >
>> > > On Mon, May 21, 2018 at 11:01 AM Arjun Satish <[email protected]
>> >
>> > > wrote:
>> > >
>> > > > Hey Jason,
>> > > >
>> > > > Thanks for your comments. Please find answers inline:
>> > > >
>> > > > On Mon, May 21, 2018 at 9:52 AM, Jason Gustafson <
>> [email protected]>
>> > > > wrote:
>> > > >
>> > > > > Hi Arjun,
>> > > > >
>> > > > > Thanks for the KIP. Just a few comments/questions:
>> > > > >
>> > > > > 1. The proposal allows users to configure the number of retries. I
>> > > > usually
>> > > > > find it easier as a user to work with timeouts since it's
>> difficult
>> > to
>> > > > know
>> > > > > how long a retry might take. Have you considered adding a timeout
>> > > option
>> > > > > which would retry until the timeout expires?
>> > > > >
>> > > >
>> > > > Good point. Updated the KIP.
>> > > >
>> > > > 2. The configs are named very generically (e.g.
>> errors.retries.limit).
>> > Do
>> > > > > you think it will be clear to users what operations these configs
>> > apply
>> > > > to?
>> > > > >
>> > > >
>> > > > As of now, these configs are applicable to all operations in the
>> > > connector
>> > > > pipeline (as mentioned in the proposed changes section). We decided
>> not
>> > > to
>> > > > have per operation limit because of the additional config
>> complexity.
>> > > >
>> > > >
>> > > > > 3. I wasn't entirely clear what messages are stored in the dead
>> > letter
>> > > > > queue. It sounds like it includes both configs and messages since
>> we
>> > > have
>> > > > > errors.dlq.include.configs? Is there a specific schema you have in
>> > > mind?
>> > > > >
>> > > >
>> > > > This has been addressed in the KIP. The DLQ will now contain only
>> raw
>> > > > messages (no additional context). We are also supporting DLQs only
>> for
>> > > sink
>> > > > connectors now.
>> > > >
>> > > >
>> > > > > 4. I didn't see it mentioned explicitly in the KIP, but I assume
>> the
>> > > > > tolerance metrics are reset after every task rebalance?
>> > > > >
>> > > >
>> > > > Great question. Yes, we will reset the tolerance metrics on every
>> > > > rebalance.
>> > > >
>> > > >
>> > > > > 5. I wonder if we can do without errors.tolerance.limit. You can
>> get
>> > a
>> > > > > similar effect using errors.tolerance.rate.limit if you allow
>> longer
>> > > > > durations. I'm not sure how useful an absolute counter is in
>> > practice.
>> > > > >
>> > > >
>> > > > Yeah, the rate limit does subsume the features offered by the
>> absolute
>> > > > counter. Removed it.
>> > > >
>> > > >
>> > > > >
>> > > > > Thanks,
>> > > > > Jason
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [DISCUSS] KIP-298: Error Handling in Connect

Reply via email to