Re: [DISCUSS] KIP-298: Error Handling in Connect

Matt Farmer Wed, 16 May 2018 19:14:22 -0700

Hey Arjun,

I like deadletterqueue all lower case, so I'm +1 on that.


Yes, in the case we were seeing there were external system failures.
We had issues connecting to S3. While the connector does include
some retry functionality, however setting these values sufficiently
high seemed to cause us to hit timeouts and cause the entire
task to fail anyway. (I think I was using something like 100 retries
during the brief test of this behavior?)

Yeah, totally understand that there could be unintended concequences
from this. I guess the use case I'm trying to optimize for is to give
folks some bubblegum to keep a high volume system limping
along until the software engineers get time to address it. So I'm
imagining the situation that I'm paged on a Saturday night because of
an intermittent network issue. With a config flag like this I could push
a config change to cause Connect to treat that as retriable and allow
me to wait until the following Monday to make changes to the code.
That may not be a sensible concern for Kafka writ large, but Connect
is a bit weird when compared with Streams or the Clients. It's almost
more of a piece of infrastructure than a library, and I generally like
infrastructure to have escape hatches like that. Just my 0.02 though. :)

Thanks,
Matt

On Tue, May 15, 2018 at 8:46 PM, Arjun Satish <arjun.sat...@gmail.com>
wrote:

> Matt,
>
> Thanks so much for your comments. Really appreciate it!
>
> 1. Good point about the acronym. I can use deadletterqueue instead of dlq
> (using all lowercase to be consistent with the other configs in Kafka).
> What do you think?
>
> 2. Could you please tell us what errors caused these tasks to fail? Were
> they because of external system failures? And if so, could they be
> implemented in the Connector itself? Or using retries with backoffs?
>
> 3. I like this idea. But did not include it here since it might be a
> stretch. One thing to note is that ConnectExceptions can be thrown from a
> variety of places in a connector. I think it should be OK for the Connector
> to throw RetriableException or something that extends it for the operation
> to be retried. By changing this behavior, a lot of existing connectors
> would have to be updated so that they don't rewrite messages into this
> sink. For example, a sink connector might write some data into the external
> system partially, and then fail with a ConnectException. Since the
> framework has no way of knowing what was written and what was not, a retry
> here might cause the same data to written again into the sink.
>
> Best,
>
>
> On Mon, May 14, 2018 at 12:46 PM, Matt Farmer <m...@frmr.me> wrote:
>
> > Hi Arjun,
> >
> > I'm following this very closely as better error handling in Connect is a
> > high priority
> > for MailChimp's Data Systems team.
> >
> > A few thoughts (in no particular order):
> >
> > For the dead letter queue configuration, could we use deadLetterQueue
> > instead of
> > dlq? Acronyms are notoriously hard to keep straight in everyone's head
> and
> > unless
> > there's a compelling reason it would be nice to use the characters and be
> > explicit.
> >
> > Have you considered any behavior that would periodically attempt to
> restart
> > failed
> > tasks after a certain amount of time? To get around our issues internally
> > we've
> > deployed a tool that monitors for failed tasks and restarts the task by
> > hitting the
> > REST API after the failure. Such a config would allow us to get rid of
> this
> > tool.
> >
> > Have you considered a config setting to allow-list additional classes as
> > retryable? In the situation we ran into, we were getting
> ConnectExceptions
> > that
> > were intermittent due to an unrelated service. With such a setting we
> could
> > have
> > deployed a config that temporarily whitelisted that Exception as
> > retry-worthy
> > and continued attempting to make progress while the other team worked
> > on mitigating the problem.
> >
> > Thanks for the KIP!
> >
> > On Wed, May 9, 2018 at 2:59 AM, Arjun Satish <arjun.sat...@gmail.com>
> > wrote:
> >
> > > All,
> > >
> > > I'd like to start a discussion on adding ways to handle and report
> record
> > > processing errors in Connect. Please find a KIP here:
> > >
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 298%3A+Error+Handling+in+Connect
> > >
> > > Any feedback will be highly appreciated.
> > >
> > > Thanks very much,
> > > Arjun
> > >
> >
>

Re: [DISCUSS] KIP-298: Error Handling in Connect

Reply via email to