Thanks Arjun. I had one additional question. High level, I'm wondering if
it makes sense to treat processing errors such as serialization failures
the same as IO errors. In the former case, retrying typically doesn't help
because the processing is deterministic. In the latter case, the failure
may be downstream (e.g. a kafka partition may be temporarily unavailable).
As a user, I would probably want the option to skip over processing
failures, but retry indefinitely if the downstream system is unavailable.
Is that use case supported?

Thanks,
Jason



On Mon, May 21, 2018 at 12:39 PM, Arjun Satish <arjun.sat...@gmail.com>
wrote:

> Thanks a lot, Ewen! I'll make sure the documentation is clear on the
> differences between retries an tolerance.
>
> Do you think percentage would have the same problem as the one you brought
> up? Also, if we say 10% tolerance, do we have to wait for the duration to
> finish before failing the task, or should we fail as soon as we hit 10%
> error.
>
> Alternatively, do you think making tolerance an Enum would be simpler?
> Where it's values are NONE (any errors kill), ALL (tolerate all errors and
> skip records) and FIRST (tolerate the first error, but fail after that)?
>
> Best,
>
>
> On Mon, May 21, 2018 at 11:28 AM, Ewen Cheslack-Postava <e...@confluent.io
> >
> wrote:
>
> > Arjun,
> >
> > Understood on retries vs tolerance -- though I suspect this will end up
> > being a bit confusing to users as well. It's two levels of error handling
> > which is what tripped me up.
> >
> > One last comment on KIP (which otherwise looks good): for the tolerance
> > setting, do we want it to be an absolute value or something like a
> > percentage? Given the current way of setting things, I'm not sure I'd
> ever
> > set it to anything but -1 or 0, with maybe 1 as an easy option for
> > restarting a connector to get it past one bad message, then reverting
> back
> > to -1 or 0.
> >
> > -Ewen
> >
> > On Mon, May 21, 2018 at 11:01 AM Arjun Satish <arjun.sat...@gmail.com>
> > wrote:
> >
> > > Hey Jason,
> > >
> > > Thanks for your comments. Please find answers inline:
> > >
> > > On Mon, May 21, 2018 at 9:52 AM, Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > Hi Arjun,
> > > >
> > > > Thanks for the KIP. Just a few comments/questions:
> > > >
> > > > 1. The proposal allows users to configure the number of retries. I
> > > usually
> > > > find it easier as a user to work with timeouts since it's difficult
> to
> > > know
> > > > how long a retry might take. Have you considered adding a timeout
> > option
> > > > which would retry until the timeout expires?
> > > >
> > >
> > > Good point. Updated the KIP.
> > >
> > > 2. The configs are named very generically (e.g. errors.retries.limit).
> Do
> > > > you think it will be clear to users what operations these configs
> apply
> > > to?
> > > >
> > >
> > > As of now, these configs are applicable to all operations in the
> > connector
> > > pipeline (as mentioned in the proposed changes section). We decided not
> > to
> > > have per operation limit because of the additional config complexity.
> > >
> > >
> > > > 3. I wasn't entirely clear what messages are stored in the dead
> letter
> > > > queue. It sounds like it includes both configs and messages since we
> > have
> > > > errors.dlq.include.configs? Is there a specific schema you have in
> > mind?
> > > >
> > >
> > > This has been addressed in the KIP. The DLQ will now contain only raw
> > > messages (no additional context). We are also supporting DLQs only for
> > sink
> > > connectors now.
> > >
> > >
> > > > 4. I didn't see it mentioned explicitly in the KIP, but I assume the
> > > > tolerance metrics are reset after every task rebalance?
> > > >
> > >
> > > Great question. Yes, we will reset the tolerance metrics on every
> > > rebalance.
> > >
> > >
> > > > 5. I wonder if we can do without errors.tolerance.limit. You can get
> a
> > > > similar effect using errors.tolerance.rate.limit if you allow longer
> > > > durations. I'm not sure how useful an absolute counter is in
> practice.
> > > >
> > >
> > > Yeah, the rate limit does subsume the features offered by the absolute
> > > counter. Removed it.
> > >
> > >
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > >
> > >
> >
>

Reply via email to