Re: [Discuss] KIP-582 Add a "continue" option for Kafka Connect error handling

Christopher Egerton Thu, 07 May 2020 20:21:26 -0700

Hi Zihan,

In my experience, direct analysis of a record isn't often necessary when
diagnosing issues with conversion or transformations. I raised the idea of
log messages because in most cases they've been completely sufficient,
paired with the source code for the converter or the transformation, to
help me diagnose issues with records in Connect.


What I'm trying to understand here is what the cases are where directly
examining a record is necessary, and how, in those cases, a console
consumer might not be sufficient. What kinds of errors are you thinking of
that might be difficult to debug with the current setup, and would benefit
from analysis in an external system?

Cheers,

Chris

On Wed, May 6, 2020 at 12:29 PM Zihan Li <lizi...@umich.edu> wrote:

> Hi Chris,
>
> Thanks a lot for the reply.
>
> To make error messages easy to find, I think we can add messages to the
> header of a broken record, just like the mechanism in DLQ. This way even
> if the broken record is stored into external systems, error messages and
> broken records are paired.
>
> Currently, after a broken record is sent to DLQ, there are usually two
> ways
> to analysis that. One method is to use some consumer tool to examine the
> messages directly. The other is to consume the DLQ again to external
> system for analysis. This proposal would help in the second case by
> eliminating the DLQ sink connector. In the first case, most open-sourced
> consumer tools are not as powerful as external tools in terms of querying,
> aggregating, and pattern finding bytes messages. Confluent KSQL is a
> powerful consumer tool, but it is not part of the open-sourced project.
> Therefore I think the proposal would help in the first case by not only
> flatten learning curve of consumer tools, but also enabling extern tools
> for
> analysis.
>
> Best,
> Zihan
>
> On 2020/05/03 17:36:34, Christopher Egerton <chr...@confluent.io> wrote:
> > Hi Zihan,
> >
> > I guess I'm still unclear on exactly what form this analysis might take.
> If
> > a converter has an issue (de)-serializing a record, for example, the
> first
> > thing I check out is the stack trace in the worker logs that tells me
> what
> > went wrong and where. The same goes for errors thrown during
> > transformation. Can we have some concrete examples about what kind
> analysis
> > performed on byte arrays in external systems might be more informative,
> > especially when it would either be performed without easy-to-find log
> > messages or require extra effort to make those log messages easy to find
> > and associate with the bytes in the external system?
> >
> > Cheers,
> >
> > Chris
> >
> > On Thu, Apr 30, 2020 at 1:01 PM Zihan Li <lizi...@umich.edu> wrote:
> >
> > > Hi Chris and Andrew,
> > >
> > > Thanks a lot for your reply!
> > >
> > > I think in most cases it is easier to analysis broken records in an
> > > external
> > > system rather than in a Kafka DLQ topic. While it might be possible to
> > > directly analysis broken records with Kafka, people are generally more
> > > familiar with external tools, such as file systems and relational
> > > databases.
> > > Exporting broken records to those external systems would enable many
> more
> > > analysis tools. Users can use those tools to audit end-to-end data flow
> > > and
> > > work with upstream teams to improve data quality. As a result, in many
> > > cases, DLQ is consumed again by an additional connector for further
> > > analysis.
> > > So as Chris have mentioned, the point of this KIP is to save user the
> > > extra
> > > time and effort to maintain and tune this addition DLQ sink connector.
> > >
> > > The expected behavior of this new error handling option should be
> > > consistent
> > > with DLQ. Namely, if any of key, value or header is broken, the record
> > > should be sent to SinkTask.putBrokenRecord() instead of DLQ.
> > >
> > > Best,
> > > Zihan
> > >
> > > On 2020/04/25 20:05:37, Christopher Egerton <chr...@confluent.io>
> wrote:
> > > > Hi Zihan,
> > > >
> > > > Thanks for the changes and the clarifications! I agree that the
> > > complexity
> > > > of maintaining a second topic and a second connector is a fair
> amount of
> > > > work; to Andrew's question, it seems less about the cost of just
> running
> > > > another connector, and more about managing that second connector (and
> > > > topic) when a lot of the logic is identical, such as topic ACLs,
> > > > credentials for the connector to access the external system, and
> other
> > > > fine-tuning.
> > > >
> > > > However, I'm still curious about the general use case here. For
> example,
> > > if
> > > > a converter fails to deserialize a record, it seems like the right
> thing
> > > to
> > > > do would be to examine the record, try to understand why it's
> failing,
> > > and
> > > > then find a converter that can handle it. If the raw byte array for
> the
> > > > Kafka message gets written to the external system instead, what's the
> > > > benefit to the user? Yes, they won't have to configure another
> connector
> > > > and manage another topic, but they're still going to want to examine
> that
> > > > data at some point; why would it be easier to deal with malformed
> records
> > > > from an external system than it would from where they originally
> broke,
> > > in
> > > > Kafka?
> > > >
> > > > If we're going to add a new feature like this to the framework, I
> just
> > > want
> > > > to make sure that there's a general use case for this that isn't
> tied to
> > > > one specific type of connector, external system, usage pattern, etc.
> > > >
> > > > Oh, and one other question that came to mind--what would the expected
> > > > behavior be if a converter was unable to deserialize a record's key,
> but
> > > > was able to deserialize its value?
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Sat, Apr 25, 2020 at 12:27 PM Andrew Schofield <
> > > andrew_schofi...@live.com>
> > > > wrote:
> > > >
> > > > > Hi Zihan,
> > > > > Thanks for the KIP. I have a question about the proposal.
> > > > >
> > > > > Why do you think putting a broken record somewhere other than a
> > > > > dead-letter topic is
> > > > > better? For a reliable system, you really want zero broken
> records, or
> > > > > perhaps close to zero.
> > > > > Broken records represent exceptions that need to be alerted and
> dealt
> > > > > with, either by
> > > > > fixing an incorrect application, or improving the quality of the
> data
> > > > > incoming. I wouldn't
> > > > > imagine a second set of connectors reading dead-letter topics
> storing
> > > the
> > > > > broken events
> > > > > elsewhere.
> > > > >
> > > > > If you did want to store them in S3, HDFS or wherever, why
> couldn't you
> > > > > run another
> > > > > connector off the dead-letter topic, with the ByteArrayConverter,
> that
> > > > > just bundles up
> > > > > the broken records as raw bytes. This seems to me very close to
> what
> > > this
> > > > > KIP is trying to
> > > > > achieve, only without needing any interface or behaviour changes
> in the
> > > > > connectors. Yes,
> > > > > you need to run more connectors, but in a distributed connect
> cluster,
> > > > > that's easy to achieve.
> > > > >
> > > > > Thanks,
> > > > > Andrew Schofield
> > > > > IBM
> > > > >
> > > > > On 24/04/2020, 22:00, "Zihan Li" <lizi...@umich.edu> wrote:
> > > > >
> > > > >     Hi Chris,
> > > > >
> > > > >     Thanks a lot for your comments.
> > > > >
> > > > >     1. The complexity comes from maintaining an additional topic
> and a
> > > > > connector, rather than configuring them. Users need to spend extra
> > > time and
> > > > > money to maintain the additional connectors. I can imagine a case
> > > where a
> > > > > user has 3 topics consumed by S3, HDFS and JDBC respectively  The
> user
> > > has
> > > > > to maintain 3 more connectors to consume three DLQs, in order to
> put
> > > broken
> > > > > records to the place they should go. This new option will give
> users a
> > > > > choice to only maintain half of their connectors, yet having broken
> > > records
> > > > > stored in each destination system.
> > > > >
> > > > >     2. This is a great question. I updated my KIP to reflect the
> most
> > > > > recent plan. We can add a new method to SinkTask called
> > > “putBrokenRecord”,
> > > > > so that sink connectors is able to differentiate between
> well-formed
> > > > > records and broken records. The default implementation of this
> method
> > > > > should be throwing errors to indicate that the connector does not
> > > support
> > > > > broken record handling yet.
> > > > >
> > > > >     3. I think the Schema should be Optional Byte Array, in order
> to
> > > > > handle all possibilities. But I’m open to suggestions on that.
> > > > >
> > > > >     4. Yes, this rejected alternative plan makes sense to me. I’ll
> put
> > > > > that into the KIP. Compared with this alternative, the point of
> this
> > > > > proposal is to save the effort to maintain twice as many
> connectors as
> > > > > necessary.
> > > > >
> > > > >     Thanks again. Looking forward to the discussion!
> > > > >
> > > > >     Sorry if you see this email twice, the previous one didn't
> show up
> > > on
> > > > > this discussion thread.
> > > > >
> > > > >     Best,
> > > > >     Zihan
> > > > >
> > > > >     On 2020/04/13 22:35:56, Christopher Egerton <
> chr...@confluent.io>
> > > > > wrote:
> > > > >     > HI Zihan,
> > > > >     >
> > > > >     > Thanks for the KIP! I have some questions that I'm hoping we
> can
> > > > > address to
> > > > >     > help better understand the motivation for this proposal.
> > > > >     >
> > > > >     > 1. In the "Motivation" section it's written that "If users
> want
> > > to
> > > > > store
> > > > >     > their broken records, they have to config a broken record
> queue,
> > > > > which is
> > > > >     > too much work for them in some cases." Could you elaborate on
> > > what
> > > > > makes
> > > > >     > this a lot of work? Ideally, users should be able to
> configure
> > > the
> > > > > dead
> > > > >     > letter queue by specifying a value for the "
> > > > >     > errors.deadletterqueue.topic.name" property in their sink
> > > connector
> > > > > config;
> > > > >     > this doesn't seem like a lot of work on the surface.
> > > > >     >
> > > > >     > 2. If the "errors.tolerance" property is set to "continue",
> would
> > > > > sink
> > > > >     > connectors be able to differentiate between well-formed
> records
> > > whose
> > > > >     > successfully-deserialized contents are byte arrays and
> malformed
> > > > > records
> > > > >     > whose contents are the still-serialized byte arrays of the
> Kafka
> > > > > message
> > > > >     > from which they came?
> > > > >     >
> > > > >     > 3. I think it's somewhat implied by the KIP, but it'd be
> nice to
> > > see
> > > > > what
> > > > >     > the schema for a malformed record would be. Null? Byte array?
> > > > > Optional byte
> > > > >     > array?
> > > > >     >
> > > > >     > 4. This is somewhat covered by the first question, but it
> seems
> > > worth
> > > > >     > pointing out that this exact functionality can already be
> > > achieved
> > > > > by using
> > > > >     > features already provided by the framework. Configure your
> > > connector
> > > > > to
> > > > >     > send malformed records to a dead letter queue topic, and
> > > configure a
> > > > >     > separate connector to consume from that dead letter queue
> topic,
> > > use
> > > > > the
> > > > >     > ByteArrayConverter to deserialize records, and send those
> > > records to
> > > > > the
> > > > >     > destination sink. It'd be nice if this were called out in the
> > > > > "Rejected
> > > > >     > Alternatives" section with a reason on why the changes
> proposed
> > > in
> > > > > the KIP
> > > > >     > are preferable, especially since it may still work as a
> viable
> > > > > workaround
> > > > >     > for users who are working on older versions of the Connect
> > > framework.
> > > > >     >
> > > > >     > Looking forward to the discussion!
> > > > >     >
> > > > >     > Cheers,
> > > > >     >
> > > > >     > Chris
> > > > >     >
> > > > >     > On Tue, Mar 24, 2020 at 11:50 AM Zihan Li <lizi...@umich.edu
> >
> > > wrote:
> > > > >     >
> > > > >     > > Hi,
> > > > >     > >
> > > > >     > > I just want to re-up this discussion thread about KIP-582
> Add a
> > > > > "continue"
> > > > >     > > option for Kafka Connect error handling.
> > > > >     > >
> > > > >     > > Wiki page: https://cwiki.apache.org/confluence/x/XRvcC <
> > > > >     > > https://cwiki.apache.org/confluence/x/XRvcC>
> > > > >     > >
> > > > >     > > JIRA: https://issues.apache.org/jira/browse/KAFKA-9740 <
> > > > >     > > https://issues.apache.org/jira/browse/KAFKA-9740>
> > > > >     > >
> > > > >     > > Please share your thoughts about adding this new error
> handling
> > > > > option to
> > > > >     > > Kafka Connect.
> > > > >     > >
> > > > >     > > Best,
> > > > >     > > Zihan
> > > > >     > >
> > > > >     > > > On Mar 18, 2020, at 12:55 PM, Zihan Li <
> lizi...@umich.edu>
> > > > > wrote:
> > > > >     > > >
> > > > >     > > > Hi all,
> > > > >     > > >
> > > > >     > > > I'd like to use this thread to discuss KIP-582 Add a
> > > "continue"
> > > > > option
> > > > >     > > for Kafka Connect error handling, please see detail at:
> > > > >     > > > https://cwiki.apache.org/confluence/x/XRvcC <
> > > > >     > > https://cwiki.apache.org/confluence/x/XRvcC>
> > > > >     > > >
> > > > >     > > > Best,
> > > > >     > > > Zihan Li
> > > > >     > >
> > > > >     > >
> > > > >     >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [Discuss] KIP-582 Add a "continue" option for Kafka Connect error handling

Reply via email to