Hi, Andrew. The Converter is part of Connect's public API, so it certainly is valid/encouraged to create new implementations when that makes sense for users. I know of several Converter implementations that are outside of the Apache Kafka project. The project's existing JSON converter is fairly limited w/r/t schemas, and when configured with the default settings it serializes Connect's own schema rather than another schema language (in part because there is no truly standard schema language for JSON, and in part because that's all that's necessary). This is convenient but also results in very verbose messages, since the schema is included in every message. Of course, you can disable the JSON converter's use of schemas altogether, at which point the converter simply serializes/deserializes simple JSON documents, arrays, and literals.
So it certainly makes sense to implement a Converter when you want or need to use other representations. The whole raison d'être that Connect defines the Converter interface is precisely because the Converter -- and not the connectors -- is solely responsibility for serialization and deserialization. Simply plug in a new Converter, and you can still use any and all of the connectors that are properly implemented. Just be aware that you may also want to implement Kafka's Serializer and Deserializer interfaces that most JVM clients require. A common approach is to implement those, and then to have the Converter implementation simply reuse those. When implementing a Converter that deals with schemas, consider the following questions. The answers will likely affect whether or how easily others can reuse that Converter in their own environments. 1) Where will the schema be persisted? Within every record? Externally and somehow only referenced from each message? Or only used to validate before serialization? The JSON converter uses the former approach, but that has significant disadvantages, including message size and the performance degradation due to repeatedly serializing and deserializing the same schema. Confluent's Schema Registry (which as you point out currently supports only Avro schemas) takes the second approach by storing the schema in a centralized service, and the AvroConverter then includes only small message identifier within every message rather than the whole schema, allowing the deserializer to know exactly which schema was used during serialization. (This takes advantage of Avro's ability to deserialize using a different but compatible schema than the serializer.) 2) How will the converter handle schema evolution? Will it constrain how producers and consumers (whether or not those are connectors) can be upgraded? 3) How will the converter handle schema variation? Is it possible to use the converter on message streams that are a mixture of a finite number of different schemas? Finally, it's difficult to speculate whether the Apache Kafka project would be interested in having a particular Converter, especially one that doesn't exist. The project has a few general purpose ones, but note that the project has also expressly avoided "owning" connector implementations (other than a few example connectors). It may be that the project does feel the Converter you describe is needed and would welcome any such contribution. But even if that weren't the case and project decided it did not want to "own" any more Converters, there probably is a community of users out there that are willing to collaborate with you. For example, as you mention people have suggested something similar in the Confluent Schema Registry community, and you might find that a welcoming place to do this work while also benefiting from the existing capabilities. I hope this has been useful. Randall On Tue, Jan 23, 2018 at 1:17 PM, Andrew Otto <o...@wikimedia.org> wrote: > Hi all, > > I’ve been thinking a lot recently about JSON and Kafka. Because JSON is > not strongly typed, it isn’t treated as a first class citizen of the Kafka > ecosystem. At Wikimedia, we use JSONSchema validated JSON > <https://blog.wikimedia.org/2017/01/13/json-hadoop-kafka/> for Kafka > messages. This makes it so easy for our many disparate teams and services > to consume data from Kafka, without having to consult a remote schema > registry to read data. (Yes we have to worry about schema evolution, but > we do this on the producer side by requiring that the only schema change > allowed is adding optional fields.) > > There’s been discussion > <https://github.com/confluentinc/schema-registry/issues/220> about > JSONSchema support in Confluent’s Schema registry, or perhaps even support > to produce validated Avro JSON (not binary) from Kafka REST proxy. > > However, the more I think about this, I realize that I don’t really care > about JSON support in Confluent products. What I (and I betcha most of the > folks who commented on the issue > <https://github.com/confluentinc/schema-registry/issues/220>) really want > is the ability to use Kafka Connect with JSON data. Kafka Connect does > sort of support this, but only if your JSON messages conform to its very > specific envelope schema format > <https://github.com/apache/kafka/blob/trunk/connect/json/ > src/main/java/org/apache/kafka/connect/json/JsonSchema.java#L61> > . > > What if…Kafka Connect provided a JSONSchemaConverter (*not* Connect’s > JsonConverter), that knew how to convert between a provided JSONSchema and > Kafka Connect internal Schemas? Would this enable what I think it would? > Would this allow for configuration of Connectors with JSONSchemas to read > JSON messages directly from a Kafka topic? Once read and converted to a > ConnectRecord, the messages could be used with any Connector out there, > right? > > I might have space in the next year to work on something like this, but I > thought I’d ask here first to see what others thought. Would this be > useful? If so, is this something that might be upstreamed into Apache > Kafka? > > - Andrew Otto > Senior Systems Engineer > Wikimedia Foundation >