Matt,

Thanks so much for your comments. Really appreciate it!

1. Good point about the acronym. I can use deadletterqueue instead of dlq
(using all lowercase to be consistent with the other configs in Kafka).
What do you think?

2. Could you please tell us what errors caused these tasks to fail? Were
they because of external system failures? And if so, could they be
implemented in the Connector itself? Or using retries with backoffs?

3. I like this idea. But did not include it here since it might be a
stretch. One thing to note is that ConnectExceptions can be thrown from a
variety of places in a connector. I think it should be OK for the Connector
to throw RetriableException or something that extends it for the operation
to be retried. By changing this behavior, a lot of existing connectors
would have to be updated so that they don't rewrite messages into this
sink. For example, a sink connector might write some data into the external
system partially, and then fail with a ConnectException. Since the
framework has no way of knowing what was written and what was not, a retry
here might cause the same data to written again into the sink.

Best,


On Mon, May 14, 2018 at 12:46 PM, Matt Farmer <m...@frmr.me> wrote:

> Hi Arjun,
>
> I'm following this very closely as better error handling in Connect is a
> high priority
> for MailChimp's Data Systems team.
>
> A few thoughts (in no particular order):
>
> For the dead letter queue configuration, could we use deadLetterQueue
> instead of
> dlq? Acronyms are notoriously hard to keep straight in everyone's head and
> unless
> there's a compelling reason it would be nice to use the characters and be
> explicit.
>
> Have you considered any behavior that would periodically attempt to restart
> failed
> tasks after a certain amount of time? To get around our issues internally
> we've
> deployed a tool that monitors for failed tasks and restarts the task by
> hitting the
> REST API after the failure. Such a config would allow us to get rid of this
> tool.
>
> Have you considered a config setting to allow-list additional classes as
> retryable? In the situation we ran into, we were getting ConnectExceptions
> that
> were intermittent due to an unrelated service. With such a setting we could
> have
> deployed a config that temporarily whitelisted that Exception as
> retry-worthy
> and continued attempting to make progress while the other team worked
> on mitigating the problem.
>
> Thanks for the KIP!
>
> On Wed, May 9, 2018 at 2:59 AM, Arjun Satish <arjun.sat...@gmail.com>
> wrote:
>
> > All,
> >
> > I'd like to start a discussion on adding ways to handle and report record
> > processing errors in Connect. Please find a KIP here:
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 298%3A+Error+Handling+in+Connect
> >
> > Any feedback will be highly appreciated.
> >
> > Thanks very much,
> > Arjun
> >
>

Reply via email to