On Mon, May 21, 2018 at 12:39 PM Arjun Satish <arjun.sat...@gmail.com>
wrote:

> Thanks a lot, Ewen! I'll make sure the documentation is clear on the
> differences between retries an tolerance.
>
> Do you think percentage would have the same problem as the one you brought
> up? Also, if we say 10% tolerance, do we have to wait for the duration to
> finish before failing the task, or should we fail as soon as we hit 10%
> error.
>
Yeah, percent might not be right either. I'd probably only reasonably set
it to 0% or 100% as something in between seems difficult to justify.


>
> Alternatively, do you think making tolerance an Enum would be simpler?
> Where it's values are NONE (any errors kill), ALL (tolerate all errors and
> skip records) and FIRST (tolerate the first error, but fail after that)?
>

I do think the values I would ever use are limited enough to just be an
enum. Not sure if anyone has use cases for larger positive values.

-Ewen


>
> Best,
>
>
> On Mon, May 21, 2018 at 11:28 AM, Ewen Cheslack-Postava <e...@confluent.io
> >
> wrote:
>
> > Arjun,
> >
> > Understood on retries vs tolerance -- though I suspect this will end up
> > being a bit confusing to users as well. It's two levels of error handling
> > which is what tripped me up.
> >
> > One last comment on KIP (which otherwise looks good): for the tolerance
> > setting, do we want it to be an absolute value or something like a
> > percentage? Given the current way of setting things, I'm not sure I'd
> ever
> > set it to anything but -1 or 0, with maybe 1 as an easy option for
> > restarting a connector to get it past one bad message, then reverting
> back
> > to -1 or 0.
> >
> > -Ewen
> >
> > On Mon, May 21, 2018 at 11:01 AM Arjun Satish <arjun.sat...@gmail.com>
> > wrote:
> >
> > > Hey Jason,
> > >
> > > Thanks for your comments. Please find answers inline:
> > >
> > > On Mon, May 21, 2018 at 9:52 AM, Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > Hi Arjun,
> > > >
> > > > Thanks for the KIP. Just a few comments/questions:
> > > >
> > > > 1. The proposal allows users to configure the number of retries. I
> > > usually
> > > > find it easier as a user to work with timeouts since it's difficult
> to
> > > know
> > > > how long a retry might take. Have you considered adding a timeout
> > option
> > > > which would retry until the timeout expires?
> > > >
> > >
> > > Good point. Updated the KIP.
> > >
> > > 2. The configs are named very generically (e.g. errors.retries.limit).
> Do
> > > > you think it will be clear to users what operations these configs
> apply
> > > to?
> > > >
> > >
> > > As of now, these configs are applicable to all operations in the
> > connector
> > > pipeline (as mentioned in the proposed changes section). We decided not
> > to
> > > have per operation limit because of the additional config complexity.
> > >
> > >
> > > > 3. I wasn't entirely clear what messages are stored in the dead
> letter
> > > > queue. It sounds like it includes both configs and messages since we
> > have
> > > > errors.dlq.include.configs? Is there a specific schema you have in
> > mind?
> > > >
> > >
> > > This has been addressed in the KIP. The DLQ will now contain only raw
> > > messages (no additional context). We are also supporting DLQs only for
> > sink
> > > connectors now.
> > >
> > >
> > > > 4. I didn't see it mentioned explicitly in the KIP, but I assume the
> > > > tolerance metrics are reset after every task rebalance?
> > > >
> > >
> > > Great question. Yes, we will reset the tolerance metrics on every
> > > rebalance.
> > >
> > >
> > > > 5. I wonder if we can do without errors.tolerance.limit. You can get
> a
> > > > similar effect using errors.tolerance.rate.limit if you allow longer
> > > > durations. I'm not sure how useful an absolute counter is in
> practice.
> > > >
> > >
> > > Yeah, the rate limit does subsume the features offered by the absolute
> > > counter. Removed it.
> > >
> > >
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > >
> > >
> >
>

Reply via email to