Hey Jason, This KIP does take serialization errors to be retriable. The typical use case is that Schema Registry can have a bad/unavailable schema, which can be corrected over time.
But since the converters throw DataExceptions for all failures, it is hard to determine what caused these errors. Hence, we are going to retry on any Exception thrown from a converter. Hope that works. Best, On Mon, May 21, 2018 at 2:15 PM, Jason Gustafson <ja...@confluent.io> wrote: > Thanks Arjun. I had one additional question. High level, I'm wondering if > it makes sense to treat processing errors such as serialization failures > the same as IO errors. In the former case, retrying typically doesn't help > because the processing is deterministic. In the latter case, the failure > may be downstream (e.g. a kafka partition may be temporarily unavailable). > As a user, I would probably want the option to skip over processing > failures, but retry indefinitely if the downstream system is unavailable. > Is that use case supported? > > Thanks, > Jason > > > > On Mon, May 21, 2018 at 12:39 PM, Arjun Satish <arjun.sat...@gmail.com> > wrote: > > > Thanks a lot, Ewen! I'll make sure the documentation is clear on the > > differences between retries an tolerance. > > > > Do you think percentage would have the same problem as the one you > brought > > up? Also, if we say 10% tolerance, do we have to wait for the duration to > > finish before failing the task, or should we fail as soon as we hit 10% > > error. > > > > Alternatively, do you think making tolerance an Enum would be simpler? > > Where it's values are NONE (any errors kill), ALL (tolerate all errors > and > > skip records) and FIRST (tolerate the first error, but fail after that)? > > > > Best, > > > > > > On Mon, May 21, 2018 at 11:28 AM, Ewen Cheslack-Postava < > e...@confluent.io > > > > > wrote: > > > > > Arjun, > > > > > > Understood on retries vs tolerance -- though I suspect this will end up > > > being a bit confusing to users as well. It's two levels of error > handling > > > which is what tripped me up. > > > > > > One last comment on KIP (which otherwise looks good): for the tolerance > > > setting, do we want it to be an absolute value or something like a > > > percentage? Given the current way of setting things, I'm not sure I'd > > ever > > > set it to anything but -1 or 0, with maybe 1 as an easy option for > > > restarting a connector to get it past one bad message, then reverting > > back > > > to -1 or 0. > > > > > > -Ewen > > > > > > On Mon, May 21, 2018 at 11:01 AM Arjun Satish <arjun.sat...@gmail.com> > > > wrote: > > > > > > > Hey Jason, > > > > > > > > Thanks for your comments. Please find answers inline: > > > > > > > > On Mon, May 21, 2018 at 9:52 AM, Jason Gustafson <ja...@confluent.io > > > > > > wrote: > > > > > > > > > Hi Arjun, > > > > > > > > > > Thanks for the KIP. Just a few comments/questions: > > > > > > > > > > 1. The proposal allows users to configure the number of retries. I > > > > usually > > > > > find it easier as a user to work with timeouts since it's difficult > > to > > > > know > > > > > how long a retry might take. Have you considered adding a timeout > > > option > > > > > which would retry until the timeout expires? > > > > > > > > > > > > > Good point. Updated the KIP. > > > > > > > > 2. The configs are named very generically (e.g. > errors.retries.limit). > > Do > > > > > you think it will be clear to users what operations these configs > > apply > > > > to? > > > > > > > > > > > > > As of now, these configs are applicable to all operations in the > > > connector > > > > pipeline (as mentioned in the proposed changes section). We decided > not > > > to > > > > have per operation limit because of the additional config complexity. > > > > > > > > > > > > > 3. I wasn't entirely clear what messages are stored in the dead > > letter > > > > > queue. It sounds like it includes both configs and messages since > we > > > have > > > > > errors.dlq.include.configs? Is there a specific schema you have in > > > mind? > > > > > > > > > > > > > This has been addressed in the KIP. The DLQ will now contain only raw > > > > messages (no additional context). We are also supporting DLQs only > for > > > sink > > > > connectors now. > > > > > > > > > > > > > 4. I didn't see it mentioned explicitly in the KIP, but I assume > the > > > > > tolerance metrics are reset after every task rebalance? > > > > > > > > > > > > > Great question. Yes, we will reset the tolerance metrics on every > > > > rebalance. > > > > > > > > > > > > > 5. I wonder if we can do without errors.tolerance.limit. You can > get > > a > > > > > similar effect using errors.tolerance.rate.limit if you allow > longer > > > > > durations. I'm not sure how useful an absolute counter is in > > practice. > > > > > > > > > > > > > Yeah, the rate limit does subsume the features offered by the > absolute > > > > counter. Removed it. > > > > > > > > > > > > > > > > > > Thanks, > > > > > Jason > > > > > > > > > > > > > > > > > > > >