On Mon, May 21, 2018 at 12:39 PM Arjun Satish <arjun.sat...@gmail.com> wrote:
> Thanks a lot, Ewen! I'll make sure the documentation is clear on the > differences between retries an tolerance. > > Do you think percentage would have the same problem as the one you brought > up? Also, if we say 10% tolerance, do we have to wait for the duration to > finish before failing the task, or should we fail as soon as we hit 10% > error. > Yeah, percent might not be right either. I'd probably only reasonably set it to 0% or 100% as something in between seems difficult to justify. > > Alternatively, do you think making tolerance an Enum would be simpler? > Where it's values are NONE (any errors kill), ALL (tolerate all errors and > skip records) and FIRST (tolerate the first error, but fail after that)? > I do think the values I would ever use are limited enough to just be an enum. Not sure if anyone has use cases for larger positive values. -Ewen > > Best, > > > On Mon, May 21, 2018 at 11:28 AM, Ewen Cheslack-Postava <e...@confluent.io > > > wrote: > > > Arjun, > > > > Understood on retries vs tolerance -- though I suspect this will end up > > being a bit confusing to users as well. It's two levels of error handling > > which is what tripped me up. > > > > One last comment on KIP (which otherwise looks good): for the tolerance > > setting, do we want it to be an absolute value or something like a > > percentage? Given the current way of setting things, I'm not sure I'd > ever > > set it to anything but -1 or 0, with maybe 1 as an easy option for > > restarting a connector to get it past one bad message, then reverting > back > > to -1 or 0. > > > > -Ewen > > > > On Mon, May 21, 2018 at 11:01 AM Arjun Satish <arjun.sat...@gmail.com> > > wrote: > > > > > Hey Jason, > > > > > > Thanks for your comments. Please find answers inline: > > > > > > On Mon, May 21, 2018 at 9:52 AM, Jason Gustafson <ja...@confluent.io> > > > wrote: > > > > > > > Hi Arjun, > > > > > > > > Thanks for the KIP. Just a few comments/questions: > > > > > > > > 1. The proposal allows users to configure the number of retries. I > > > usually > > > > find it easier as a user to work with timeouts since it's difficult > to > > > know > > > > how long a retry might take. Have you considered adding a timeout > > option > > > > which would retry until the timeout expires? > > > > > > > > > > Good point. Updated the KIP. > > > > > > 2. The configs are named very generically (e.g. errors.retries.limit). > Do > > > > you think it will be clear to users what operations these configs > apply > > > to? > > > > > > > > > > As of now, these configs are applicable to all operations in the > > connector > > > pipeline (as mentioned in the proposed changes section). We decided not > > to > > > have per operation limit because of the additional config complexity. > > > > > > > > > > 3. I wasn't entirely clear what messages are stored in the dead > letter > > > > queue. It sounds like it includes both configs and messages since we > > have > > > > errors.dlq.include.configs? Is there a specific schema you have in > > mind? > > > > > > > > > > This has been addressed in the KIP. The DLQ will now contain only raw > > > messages (no additional context). We are also supporting DLQs only for > > sink > > > connectors now. > > > > > > > > > > 4. I didn't see it mentioned explicitly in the KIP, but I assume the > > > > tolerance metrics are reset after every task rebalance? > > > > > > > > > > Great question. Yes, we will reset the tolerance metrics on every > > > rebalance. > > > > > > > > > > 5. I wonder if we can do without errors.tolerance.limit. You can get > a > > > > similar effect using errors.tolerance.rate.limit if you allow longer > > > > durations. I'm not sure how useful an absolute counter is in > practice. > > > > > > > > > > Yeah, the rate limit does subsume the features offered by the absolute > > > counter. Removed it. > > > > > > > > > > > > > > Thanks, > > > > Jason > > > > > > > > > > > > > >