Re: [DISCUSS] KIP-298: Error Handling in Connect

Konstantine Karantasis Thu, 17 May 2018 09:14:20 -0700

Thanks Arjun for your quick response.

Adding an example for the failure log improves things, but I think it'd be
better to also add the schema definition of these Json entries. And I'll
agree with Magesh that this format should be public API.


Also, does the current example have a copy/paste typo? Seems that the
TRANSFORMATION stage in the end has the config of a converter.
Similar to the above, fields for 'key' and 'headers' (and their conversion
stages) are skipped when they are not defined? Or should they present and
empty? A schema definition would help to know what a consumer of such logs
should expect.

Also, thanks for adding some info for error on the source side. However, I
feel the current description might be a little bit ambiguous. I read:
"For errors in a source connector, the process is similar, but care needs
to be taken while writing back to the source." and sounds like it's
suggested that Connect will write records back to the source, which can't
be correct.

Finally, a nit: " adds store the row information "... typo?

Thanks,
- Konstantine



On Thu, May 17, 2018 at 12:48 AM, Arjun Satish <[email protected]>
wrote:

> On Wed, May 16, 2018 at 7:13 PM, Matt Farmer <[email protected]> wrote:
>
> > Hey Arjun,
> >
> > I like deadletterqueue all lower case, so I'm +1 on that.
> >
>
> Super! updated the KIP.
>
>
> >
> > Yes, in the case we were seeing there were external system failures.
> > We had issues connecting to S3. While the connector does include
> > some retry functionality, however setting these values sufficiently
> > high seemed to cause us to hit timeouts and cause the entire
> > task to fail anyway. (I think I was using something like 100 retries
> > during the brief test of this behavior?)
> >
>
> I am guessing these issues come up with trying to write to S3. Do you think
> the S3 connector can detect the safe situations where it can throw
> RetriableExceptions instead of ConnectExceptions here (when the connector
> think it is safe to do so)?
>
>
> >
> > Yeah, totally understand that there could be unintended concequences
> > from this. I guess the use case I'm trying to optimize for is to give
> > folks some bubblegum to keep a high volume system limping
> > along until the software engineers get time to address it. So I'm
> > imagining the situation that I'm paged on a Saturday night because of
> > an intermittent network issue. With a config flag like this I could push
> > a config change to cause Connect to treat that as retriable and allow
> > me to wait until the following Monday to make changes to the code.
> > That may not be a sensible concern for Kafka writ large, but Connect
> > is a bit weird when compared with Streams or the Clients. It's almost
> > more of a piece of infrastructure than a library, and I generally like
> > infrastructure to have escape hatches like that. Just my 0.02 though. :)
> >
>
> haha yes, it would be good to avoid those Saturday night pagers. Again, I
> am hesitant to imply retries on ConnectExceptions. We could definitely
> define new Exceptions in the Connector, which can be thrown to retry if the
> connector thinks it is safe to do so. We need to know that a retry can be
> super dangerous in a Task.put(List<SinkRecord>). Duplicate records can
> easily creep in, and can be notoriously hard to detect and clean up.
>
>
>
> > Thanks,
> > Matt
> >
> > On Tue, May 15, 2018 at 8:46 PM, Arjun Satish <[email protected]>
> > wrote:
> >
> > > Matt,
> > >
> > > Thanks so much for your comments. Really appreciate it!
> > >
> > > 1. Good point about the acronym. I can use deadletterqueue instead of
> dlq
> > > (using all lowercase to be consistent with the other configs in Kafka).
> > > What do you think?
> > >
> > > 2. Could you please tell us what errors caused these tasks to fail?
> Were
> > > they because of external system failures? And if so, could they be
> > > implemented in the Connector itself? Or using retries with backoffs?
> > >
> > > 3. I like this idea. But did not include it here since it might be a
> > > stretch. One thing to note is that ConnectExceptions can be thrown
> from a
> > > variety of places in a connector. I think it should be OK for the
> > Connector
> > > to throw RetriableException or something that extends it for the
> > operation
> > > to be retried. By changing this behavior, a lot of existing connectors
> > > would have to be updated so that they don't rewrite messages into this
> > > sink. For example, a sink connector might write some data into the
> > external
> > > system partially, and then fail with a ConnectException. Since the
> > > framework has no way of knowing what was written and what was not, a
> > retry
> > > here might cause the same data to written again into the sink.
> > >
> > > Best,
> > >
> > >
> > > On Mon, May 14, 2018 at 12:46 PM, Matt Farmer <[email protected]> wrote:
> > >
> > > > Hi Arjun,
> > > >
> > > > I'm following this very closely as better error handling in Connect
> is
> > a
> > > > high priority
> > > > for MailChimp's Data Systems team.
> > > >
> > > > A few thoughts (in no particular order):
> > > >
> > > > For the dead letter queue configuration, could we use deadLetterQueue
> > > > instead of
> > > > dlq? Acronyms are notoriously hard to keep straight in everyone's
> head
> > > and
> > > > unless
> > > > there's a compelling reason it would be nice to use the characters
> and
> > be
> > > > explicit.
> > > >
> > > > Have you considered any behavior that would periodically attempt to
> > > restart
> > > > failed
> > > > tasks after a certain amount of time? To get around our issues
> > internally
> > > > we've
> > > > deployed a tool that monitors for failed tasks and restarts the task
> by
> > > > hitting the
> > > > REST API after the failure. Such a config would allow us to get rid
> of
> > > this
> > > > tool.
> > > >
> > > > Have you considered a config setting to allow-list additional classes
> > as
> > > > retryable? In the situation we ran into, we were getting
> > > ConnectExceptions
> > > > that
> > > > were intermittent due to an unrelated service. With such a setting we
> > > could
> > > > have
> > > > deployed a config that temporarily whitelisted that Exception as
> > > > retry-worthy
> > > > and continued attempting to make progress while the other team worked
> > > > on mitigating the problem.
> > > >
> > > > Thanks for the KIP!
> > > >
> > > > On Wed, May 9, 2018 at 2:59 AM, Arjun Satish <[email protected]
> >
> > > > wrote:
> > > >
> > > > > All,
> > > > >
> > > > > I'd like to start a discussion on adding ways to handle and report
> > > record
> > > > > processing errors in Connect. Please find a KIP here:
> > > > >
> > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > 298%3A+Error+Handling+in+Connect
> > > > >
> > > > > Any feedback will be highly appreciated.
> > > > >
> > > > > Thanks very much,
> > > > > Arjun
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-298: Error Handling in Connect

Reply via email to