[ https://issues.apache.org/jira/browse/KAFKA-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303666#comment-15303666 ]
Ewen Cheslack-Postava commented on KAFKA-3744: ---------------------------------------------- Just to second [~ijuma]'s comments, this absolutely needs a KIP. "Affects the format" doesn't quite capture the requirements for a KIP. Even things that affect semantics but don't strictly affect format are subject to KIPs. The end result of the KIP could be that it doesn't affect older clients that simply ignore those bits, but its still really important to have that discussion and make sure that's an acceptable path. Re: the specific proposal, I'm skeptical. Magic bytes are a *very* common approach for format detection and don't require any specialized support, are used by a lot of people today, and seems to work fine in practice. From my reading, the proposal also assumes that key and value serialization is the same, which it turns out is not the case for many users (and I have found this in practice a lot based on issues filed against Confluent's REST proxy where people want simple serialization for keys, e.g. UTF8 strings, and complex serialization for values, e.g. GenericRecords). Formats like JSON are the main exception here re: magic bytes. My impression is that folks that actually think about multiple formats realize up front that you need magic bytes and include it. If you use something like JSON, you tend to track this somehow externally such that you know based on topics what format you're using. I'm not convinced of the benefit here. > Message format needs to identify serializer > ------------------------------------------- > > Key: KAFKA-3744 > URL: https://issues.apache.org/jira/browse/KAFKA-3744 > Project: Kafka > Issue Type: Improvement > Reporter: David Kay > Priority: Minor > > https://issues.apache.org/jira/browse/KAFKA-3698 was recently resolved with > https://github.com/apache/kafka/commit/27a19b964af35390d78e1b3b50bc03d23327f4d0. > But Kafka documentation on message formats needs to be more explicit for new > users. Section 1.3 Step 4 says: "Send some messages" and takes lines of text > from the command line. Beginner's guide > (http://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign > Slide 104 says: > {noformat} > Kafka does not care about data format of msg payload > Up to developer to handle serialization/deserialization > Common choices: Avro, JSON > {noformat} > If one producer sends lines of console text, another producer sends Avro, a > third producer sends JSON, and a fourth sends CBOR, how does the consumer > identify which deserializer to use for the payload? The commit includes an > opaque K byte Key that could potentially include a codec identifier, but > provides no guidance on how to use it: > {quote} > "Leaving the key and value opaque is the right decision: there is a great > deal of progress being made on serialization libraries right now, and any > particular choice is unlikely to be right for all uses. Needless to say a > particular application using Kafka would likely mandate a particular > serialization type as part of its usage." > {quote} > Mandating any particular serialization is as unrealistic as mandating a > single mime-type for all web content. There must be a way to signal the > serialization used to produce this message's V byte payload, and documenting > the existence of even a rudimentary codec registry with a few values (text, > Avro, JSON, CBOR) would establish the pattern to be used for future > serialization libraries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)