[jira] [Commented] (KAFKA-3744) Message format needs to identify serializer

ASF GitHub Bot (JIRA) Sat, 24 Feb 2018 23:23:08 -0800

    [ 
https://issues.apache.org/jira/browse/KAFKA-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375972#comment-16375972
 ]


ASF GitHub Bot commented on KAFKA-3744:
---------------------------------------

hachikuji closed pull request #1419: KAFKA-3744: Allocate 2 attribute bits to 
signal payload format
URL: https://github.com/apache/kafka/pull/1419
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/docs/implementation.html b/docs/implementation.html
index 16ba07a456c..aa58d8c740a 100644
--- a/docs/implementation.html
+++ b/docs/implementation.html
@@ -146,12 +146,24 @@ <h3><a id="messages" href="#messages">5.3 
Messages</a></h3>
 <p>
 Messages consist of a fixed-size header, a variable length opaque key byte 
array and a variable length opaque value byte array. The header contains the 
following fields:
 <ul>
-    <li> A CRC32 checksum to detect corruption or truncation. <li/>
+    <li> A CRC32 checksum to detect corruption or truncation. </li>
     <li> A format version. </li>
     <li> An attributes identifier </li>
     <li> A timestamp </li>
 </ul>
-Leaving the key and value opaque is the right decision: there is a great deal 
of progress being made on serialization libraries right now, and any particular 
choice is unlikely to be right for all uses. Needless to say a particular 
application using Kafka would likely mandate a particular serialization type as 
part of its usage. The <code>MessageSet</code> interface is simply an iterator 
over messages with specialized methods for bulk reading and writing to an NIO 
<code>Channel</code>.
+Leaving the key and payload mostly opaque is the right decision: there is a 
great deal of progress being made on serialization libraries right now, and any 
particular choice is unlikely to be right for all uses. But to facilitate 
interoperability two attribute bits are defined as a serialization selector:
+<ul>
+  <li>0 and 1 specify two payload encodings (text and avro-binary); key format 
is unspecified.</li>
+  <li>2 specifies that the key must be a JSON object with a property "t" 
containing a
+<a 
href="http://www.iana.org/assignments/media-types/media-types.xhtml";>media-type</a>
 string
+registered with IANA.  For example, key <pre>  {"t":"application/cbor"}</pre> 
specifies that the
+payload is serialized using Concise Binary Object Representation, RFC 7049. 
The JSON object in key
+may contain an arbitrary set of additional properties.  Using media-type 
allows payloads of any
+registered format (e.g., image/jpeg, application/pdf) to be specified.</li>
+  <li>3 is reserved; key and payload formats are unspecified.</ul>
+</ul>
+
+<code>MessageSet</code> interface is simply an iterator over messages with 
specialized methods for bulk reading and writing to an NIO <code>Channel</code>.
 
 <h3><a id="messageformat" href="#messageformat">5.4 Message Format</a></h3>
 
@@ -165,10 +177,16 @@ <h3><a id="messageformat" href="#messageformat">5.4 
Message Format</a></h3>
      *      1 : gzip
      *      2 : snappy
      *      3 : lz4
+     *      4~7 : reserved
      *    bit 3 : Timestamp type
      *      0 : create time
      *      1 : log append time
-     *    bit 4 ~ 7 : reserved
+     *    bit 4 ~ 5 : Serialization
+     *      0 : key: opaque, payload: text/plain
+     *      1 : key: opaque, payload: avro-binary
+     *      2 : key: json object, payload: media-type specified by property "t"
+     *      3 : reserved (key: opaque, payload: opaque)
+     *    bit 6 ~ 7 : reserved
      * 4. (Optional) 8 byte timestamp only if "magic" identifier is greater 
than 0
      * 5. 4 byte key length, containing length K
      * 6. K byte key
@@ -195,8 +213,8 @@ <h3><a id="log" href="#log">5.5 Log</a></h3>
 timestamp      : 8 bytes (Only exists when magic value is greater than zero)
 key length     : 4 bytes
 key            : K bytes
-value length   : 4 bytes
-value          : V bytes
+payload length : 4 bytes
+payload        : V bytes
 </pre>
 <p>
 The use of the message offset as the message id is unusual. Our original idea 
was to use a GUID generated by the producer, and maintain a mapping from GUID 
to offset on each broker. But since a consumer must maintain an ID for each 
server, the global uniqueness of the GUID provides no value. Furthermore the 
complexity of maintaining the mapping from a random id to an offset requires a 
heavy weight index structure which must be synchronized with disk, essentially 
requiring a full persistent random-access data structure. Thus to simplify the 
lookup structure we decided to use a simple per-partition atomic counter which 
could be coupled with the partition id and node id to uniquely identify a 
message; this makes the lookup structure simpler, though multiple seeks per 
consumer request are still likely. However once we settled on a counter, the 
jump to directly using the offset seemed natural&mdash;both after all are 
monotonically increasing integers unique to a partition. Since the offset is 
hidden from the consumer API this decision is ultimately an implementation 
detail and we went with the more efficient approach.


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Message format needs to identify serializer
> -------------------------------------------
>
>                 Key: KAFKA-3744
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3744
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: David Kay
>            Priority: Minor
>
> https://issues.apache.org/jira/browse/KAFKA-3698 was recently resolved with 
> https://github.com/apache/kafka/commit/27a19b964af35390d78e1b3b50bc03d23327f4d0.
> But Kafka documentation on message formats needs to be more explicit for new 
> users. Section 1.3 Step 4 says: "Send some messages" and takes lines of text 
> from the command line. Beginner's guide 
> (http://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign 
> Slide 104 says:
> {noformat}
>    Kafka does not care about data format of msg payload
>    Up to developer to handle serialization/deserialization
>       Common choices: Avro, JSON
> {noformat}
> If one producer sends lines of console text, another producer sends Avro, a 
> third producer sends JSON, and a fourth sends CBOR, how does the consumer 
> identify which deserializer to use for the payload?  The commit includes an 
> opaque K byte Key that could potentially include a codec identifier, but 
> provides no guidance on how to use it:
> {quote}
> "Leaving the key and value opaque is the right decision: there is a great 
> deal of progress being made on serialization libraries right now, and any 
> particular choice is unlikely to be right for all uses. Needless to say a 
> particular application using Kafka would likely mandate a particular 
> serialization type as part of its usage."
> {quote}
> Mandating any particular serialization is as unrealistic as mandating a 
> single mime-type for all web content.  There must be a way to signal the 
> serialization used to produce this message's V byte payload, and documenting 
> the existence of even a rudimentary codec registry with a few values (text, 
> Avro, JSON, CBOR) would establish the pattern to be used for future 
> serialization libraries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KAFKA-3744) Message format needs to identify serializer

Reply via email to