[jira] [Commented] (KAFKA-3744) Message format needs to identify serializer

Ewen Cheslack-Postava (JIRA) Fri, 27 May 2016 00:01:37 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303666#comment-15303666
 ]


Ewen Cheslack-Postava commented on KAFKA-3744:
----------------------------------------------

Just to second [~ijuma]'s comments, this absolutely needs a KIP. "Affects the 
format" doesn't quite capture the requirements for a KIP. Even things that 
affect semantics but don't strictly affect format are subject to KIPs. The end 
result of the KIP could be that it doesn't affect older clients that simply 
ignore those bits, but its still really important to have that discussion and 
make sure that's an acceptable path.

Re: the specific proposal, I'm skeptical. Magic bytes are a *very* common 
approach for format detection and don't require any specialized support, are 
used by a lot of people today, and seems to work fine in practice. From my 
reading, the proposal also assumes that key and value serialization is the 
same, which it turns out is not the case for many users (and I have found this 
in practice a lot based on issues filed against Confluent's REST proxy where 
people want simple serialization for keys, e.g. UTF8 strings, and complex 
serialization for values, e.g. GenericRecords). Formats like JSON are the main 
exception here re: magic bytes. My impression is that folks that actually think 
about multiple formats realize up front that you need magic bytes and include 
it. If you use something like JSON, you tend to track this somehow externally 
such that you know based on topics what format you're using. I'm not convinced 
of the benefit here.
 

> Message format needs to identify serializer
> -------------------------------------------
>
>                 Key: KAFKA-3744
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3744
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: David Kay
>            Priority: Minor
>
> https://issues.apache.org/jira/browse/KAFKA-3698 was recently resolved with 
> https://github.com/apache/kafka/commit/27a19b964af35390d78e1b3b50bc03d23327f4d0.
> But Kafka documentation on message formats needs to be more explicit for new 
> users. Section 1.3 Step 4 says: "Send some messages" and takes lines of text 
> from the command line. Beginner's guide 
> (http://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign 
> Slide 104 says:
> {noformat}
>    Kafka does not care about data format of msg payload
>    Up to developer to handle serialization/deserialization
>       Common choices: Avro, JSON
> {noformat}
> If one producer sends lines of console text, another producer sends Avro, a 
> third producer sends JSON, and a fourth sends CBOR, how does the consumer 
> identify which deserializer to use for the payload?  The commit includes an 
> opaque K byte Key that could potentially include a codec identifier, but 
> provides no guidance on how to use it:
> {quote}
> "Leaving the key and value opaque is the right decision: there is a great 
> deal of progress being made on serialization libraries right now, and any 
> particular choice is unlikely to be right for all uses. Needless to say a 
> particular application using Kafka would likely mandate a particular 
> serialization type as part of its usage."
> {quote}
> Mandating any particular serialization is as unrealistic as mandating a 
> single mime-type for all web content.  There must be a way to signal the 
> serialization used to produce this message's V byte payload, and documenting 
> the existence of even a rudimentary codec registry with a few values (text, 
> Avro, JSON, CBOR) would establish the pattern to be used for future 
> serialization libraries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-3744) Message format needs to identify serializer

Reply via email to