Hi, Andrew.

The Converter is part of Connect's public API, so it certainly is
valid/encouraged to create new implementations when that makes sense for
users. I know of several Converter implementations that are outside of the
Apache Kafka project. The project's existing JSON converter is fairly
limited w/r/t schemas, and when configured with the default settings it
serializes Connect's own schema rather than another schema language (in
part because there is no truly standard schema language for JSON, and in
part because that's all that's necessary). This is convenient but also
results in very verbose messages, since the schema is included in every
message. Of course, you can disable the JSON converter's use of schemas
altogether, at which point the converter simply serializes/deserializes
simple JSON documents, arrays, and literals.

So it certainly makes sense to implement a Converter when you want or need
to use other representations. The whole raison d'être that Connect
defines the Converter interface is precisely because the Converter -- and
not the connectors -- is solely responsibility for serialization and
deserialization. Simply plug in a new Converter, and you can still use any
and all of the connectors that are properly implemented. Just be aware that
you may also want to implement Kafka's Serializer and Deserializer
interfaces that most JVM clients require. A common approach is to implement
those, and then to have the Converter implementation simply reuse those.

When implementing a Converter that deals with schemas, consider the
following questions. The answers will likely affect whether or how easily
others can reuse that Converter in their own environments.

1) Where will the schema be persisted? Within every record? Externally and
somehow only referenced from each message? Or only used to validate before
serialization? The JSON converter uses the former approach, but that has
significant disadvantages, including message size and the performance
degradation due to repeatedly serializing and deserializing the same
schema. Confluent's Schema Registry (which as you point out currently
supports only Avro schemas) takes the second approach by storing the schema
in a centralized service, and the AvroConverter then includes only small
message identifier within every message rather than the whole schema,
allowing the deserializer to know exactly which schema was used during
serialization. (This takes advantage of Avro's ability to deserialize using
a different but compatible schema than the serializer.)
2) How will the converter handle schema evolution? Will it constrain how
producers and consumers (whether or not those are connectors) can be
upgraded?
3) How will the converter handle schema variation? Is it possible to use
the converter on message streams that are a mixture of a finite number of
different schemas?

Finally, it's difficult to speculate whether the Apache Kafka project would
be interested in having a particular Converter, especially one that doesn't
exist. The project has a few general purpose ones, but note that the
project has also expressly avoided "owning" connector implementations
(other than a few example connectors). It may be that the project does feel
the Converter you describe is needed and would welcome any such
contribution. But even if that weren't the case and project decided it did
not want to "own" any more Converters, there probably is a community of
users out there that are willing to collaborate with you. For example, as
you mention people have suggested something similar in the Confluent Schema
Registry community, and you might find that a welcoming place to do this
work while also benefiting from the existing capabilities.

I hope this has been useful.

Randall


On Tue, Jan 23, 2018 at 1:17 PM, Andrew Otto <o...@wikimedia.org> wrote:

> Hi all,
>
> I’ve been thinking a lot recently about JSON and Kafka.  Because JSON is
> not strongly typed, it isn’t treated as a first class citizen of the Kafka
> ecosystem.  At Wikimedia, we use JSONSchema validated JSON
> <https://blog.wikimedia.org/2017/01/13/json-hadoop-kafka/> for Kafka
> messages.  This makes it so easy for our many disparate teams and services
> to consume data from Kafka, without having to consult a remote schema
> registry to read data.  (Yes we have to worry about schema evolution, but
> we do this on the producer side by requiring that the only schema change
> allowed is adding optional fields.)
>
> There’s been discussion
> <https://github.com/confluentinc/schema-registry/issues/220> about
> JSONSchema support in Confluent’s Schema registry, or perhaps even support
> to produce validated Avro JSON (not binary) from Kafka REST proxy.
>
> However, the more I think about this, I realize that I don’t really care
> about JSON support in Confluent products.  What I (and I betcha most of the
> folks who commented on the issue
> <https://github.com/confluentinc/schema-registry/issues/220>) really want
> is the ability to use Kafka Connect with JSON data.  Kafka Connect does
> sort of support this, but only if your JSON messages conform to its very
> specific envelope schema format
> <https://github.com/apache/kafka/blob/trunk/connect/json/
> src/main/java/org/apache/kafka/connect/json/JsonSchema.java#L61>
> .
>
> What if…Kafka Connect provided a JSONSchemaConverter (*not* Connect’s
> JsonConverter), that knew how to convert between a provided JSONSchema and
> Kafka Connect internal Schemas?  Would this enable what I think it would?
> Would this allow for configuration of Connectors with JSONSchemas to read
> JSON messages directly from a Kafka topic?  Once read and converted to a
> ConnectRecord, the messages could be used with any Connector out there,
> right?
>
> I might have space in the next year to work on something like this, but I
> thought I’d ask here first to see what others thought.  Would this be
> useful?  If so, is this something that might be upstreamed into Apache
> Kafka?
>
> - Andrew Otto
>   Senior Systems Engineer
>   Wikimedia Foundation
>

Reply via email to