Matt, Thanks so much for your comments. Really appreciate it!
1. Good point about the acronym. I can use deadletterqueue instead of dlq (using all lowercase to be consistent with the other configs in Kafka). What do you think? 2. Could you please tell us what errors caused these tasks to fail? Were they because of external system failures? And if so, could they be implemented in the Connector itself? Or using retries with backoffs? 3. I like this idea. But did not include it here since it might be a stretch. One thing to note is that ConnectExceptions can be thrown from a variety of places in a connector. I think it should be OK for the Connector to throw RetriableException or something that extends it for the operation to be retried. By changing this behavior, a lot of existing connectors would have to be updated so that they don't rewrite messages into this sink. For example, a sink connector might write some data into the external system partially, and then fail with a ConnectException. Since the framework has no way of knowing what was written and what was not, a retry here might cause the same data to written again into the sink. Best, On Mon, May 14, 2018 at 12:46 PM, Matt Farmer <m...@frmr.me> wrote: > Hi Arjun, > > I'm following this very closely as better error handling in Connect is a > high priority > for MailChimp's Data Systems team. > > A few thoughts (in no particular order): > > For the dead letter queue configuration, could we use deadLetterQueue > instead of > dlq? Acronyms are notoriously hard to keep straight in everyone's head and > unless > there's a compelling reason it would be nice to use the characters and be > explicit. > > Have you considered any behavior that would periodically attempt to restart > failed > tasks after a certain amount of time? To get around our issues internally > we've > deployed a tool that monitors for failed tasks and restarts the task by > hitting the > REST API after the failure. Such a config would allow us to get rid of this > tool. > > Have you considered a config setting to allow-list additional classes as > retryable? In the situation we ran into, we were getting ConnectExceptions > that > were intermittent due to an unrelated service. With such a setting we could > have > deployed a config that temporarily whitelisted that Exception as > retry-worthy > and continued attempting to make progress while the other team worked > on mitigating the problem. > > Thanks for the KIP! > > On Wed, May 9, 2018 at 2:59 AM, Arjun Satish <arjun.sat...@gmail.com> > wrote: > > > All, > > > > I'd like to start a discussion on adding ways to handle and report record > > processing errors in Connect. Please find a KIP here: > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > 298%3A+Error+Handling+in+Connect > > > > Any feedback will be highly appreciated. > > > > Thanks very much, > > Arjun > > >