After some discussions with Jason, we decided that the error handler should retry an operation only it throws a RetriableException. At the same time, any Exception in the Transformation and Converter step can be tolerated (by skipping the record). I updated the table in the Proposed Changes section to reflect this:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-298%3A+Error+Handling+in+Connect#KIP-298:ErrorHandlinginConnect-ProposedChanges Thanks, On Mon, May 21, 2018 at 3:12 PM, Arjun Satish <arjun.sat...@gmail.com> wrote: > Hey Jason, > > This KIP does take serialization errors to be retriable. The typical use > case is that Schema Registry can have a bad/unavailable schema, which can > be corrected over time. > > But since the converters throw DataExceptions for all failures, it is hard > to determine what caused these errors. Hence, we are going to retry on any > Exception thrown from a converter. > > Hope that works. > > Best, > > > On Mon, May 21, 2018 at 2:15 PM, Jason Gustafson <ja...@confluent.io> > wrote: > >> Thanks Arjun. I had one additional question. High level, I'm wondering if >> it makes sense to treat processing errors such as serialization failures >> the same as IO errors. In the former case, retrying typically doesn't help >> because the processing is deterministic. In the latter case, the failure >> may be downstream (e.g. a kafka partition may be temporarily unavailable). >> As a user, I would probably want the option to skip over processing >> failures, but retry indefinitely if the downstream system is unavailable. >> Is that use case supported? >> >> Thanks, >> Jason >> >> >> >> On Mon, May 21, 2018 at 12:39 PM, Arjun Satish <arjun.sat...@gmail.com> >> wrote: >> >> > Thanks a lot, Ewen! I'll make sure the documentation is clear on the >> > differences between retries an tolerance. >> > >> > Do you think percentage would have the same problem as the one you >> brought >> > up? Also, if we say 10% tolerance, do we have to wait for the duration >> to >> > finish before failing the task, or should we fail as soon as we hit 10% >> > error. >> > >> > Alternatively, do you think making tolerance an Enum would be simpler? >> > Where it's values are NONE (any errors kill), ALL (tolerate all errors >> and >> > skip records) and FIRST (tolerate the first error, but fail after that)? >> > >> > Best, >> > >> > >> > On Mon, May 21, 2018 at 11:28 AM, Ewen Cheslack-Postava < >> e...@confluent.io >> > > >> > wrote: >> > >> > > Arjun, >> > > >> > > Understood on retries vs tolerance -- though I suspect this will end >> up >> > > being a bit confusing to users as well. It's two levels of error >> handling >> > > which is what tripped me up. >> > > >> > > One last comment on KIP (which otherwise looks good): for the >> tolerance >> > > setting, do we want it to be an absolute value or something like a >> > > percentage? Given the current way of setting things, I'm not sure I'd >> > ever >> > > set it to anything but -1 or 0, with maybe 1 as an easy option for >> > > restarting a connector to get it past one bad message, then reverting >> > back >> > > to -1 or 0. >> > > >> > > -Ewen >> > > >> > > On Mon, May 21, 2018 at 11:01 AM Arjun Satish <arjun.sat...@gmail.com >> > >> > > wrote: >> > > >> > > > Hey Jason, >> > > > >> > > > Thanks for your comments. Please find answers inline: >> > > > >> > > > On Mon, May 21, 2018 at 9:52 AM, Jason Gustafson < >> ja...@confluent.io> >> > > > wrote: >> > > > >> > > > > Hi Arjun, >> > > > > >> > > > > Thanks for the KIP. Just a few comments/questions: >> > > > > >> > > > > 1. The proposal allows users to configure the number of retries. I >> > > > usually >> > > > > find it easier as a user to work with timeouts since it's >> difficult >> > to >> > > > know >> > > > > how long a retry might take. Have you considered adding a timeout >> > > option >> > > > > which would retry until the timeout expires? >> > > > > >> > > > >> > > > Good point. Updated the KIP. >> > > > >> > > > 2. The configs are named very generically (e.g. >> errors.retries.limit). >> > Do >> > > > > you think it will be clear to users what operations these configs >> > apply >> > > > to? >> > > > > >> > > > >> > > > As of now, these configs are applicable to all operations in the >> > > connector >> > > > pipeline (as mentioned in the proposed changes section). We decided >> not >> > > to >> > > > have per operation limit because of the additional config >> complexity. >> > > > >> > > > >> > > > > 3. I wasn't entirely clear what messages are stored in the dead >> > letter >> > > > > queue. It sounds like it includes both configs and messages since >> we >> > > have >> > > > > errors.dlq.include.configs? Is there a specific schema you have in >> > > mind? >> > > > > >> > > > >> > > > This has been addressed in the KIP. The DLQ will now contain only >> raw >> > > > messages (no additional context). We are also supporting DLQs only >> for >> > > sink >> > > > connectors now. >> > > > >> > > > >> > > > > 4. I didn't see it mentioned explicitly in the KIP, but I assume >> the >> > > > > tolerance metrics are reset after every task rebalance? >> > > > > >> > > > >> > > > Great question. Yes, we will reset the tolerance metrics on every >> > > > rebalance. >> > > > >> > > > >> > > > > 5. I wonder if we can do without errors.tolerance.limit. You can >> get >> > a >> > > > > similar effect using errors.tolerance.rate.limit if you allow >> longer >> > > > > durations. I'm not sure how useful an absolute counter is in >> > practice. >> > > > > >> > > > >> > > > Yeah, the rate limit does subsume the features offered by the >> absolute >> > > > counter. Removed it. >> > > > >> > > > >> > > > > >> > > > > Thanks, >> > > > > Jason >> > > > > >> > > > > >> > > > >> > > >> > >> > >