I am attempting to understand the details of the content of the log segment file in Kafka.
The documentation (http://kafka.apache.org/081/documentation.html#log) suggests: The exact binary format for messages is versioned and maintained as a standard interface so message sets can be transfered between producer, broker, and client without recopying or conversion when desirable. This format is as follows: On-disk format of a message message length : 4 bytes (value: 1+4+n) "magic" value : 1 byte crc : 4 bytes payload : n bytes But I am struggling to map the documentation to what I see on the disk. I created a topic, named simple-topic, and added one message to it (via the console producer). The message payload was “message1”. The DumpLogSegments tool shows: Dumping /tmp/kafka-logs/sample-topic-0/00000000000000000000.log Starting offset: 0 offset: 0 position: 0 isvalid: true payloadsize: 8 magic: 0 compresscodec: NoCompressionCodec crc: 3916773564 Taking a hex dump of the (only) log file: sample-topic-0 sgg$ hexdump -C 00000000000000000000.log | more 00000000 00 00 00 00 00 00 00 00 00 00 00 16 e9 75 38 bc |.............u8.| 00000010 00 00 ff ff ff ff 00 00 00 08 6d 65 73 73 61 67 |..........messag| 00000020 65 31 |e1| 00000022 I tried to “reverse engineer” the contents, to see how it corresponds to the documentation: Bytes 0-7 (00 00 00 00 00 00 00 00). I am not sure what this is, some sort of filler? Bytes 8-11 (00 00 00 16) seems to be some length field? Decimal 22, which seems to correspond to the length of the entire message, but more than 1+4+n than suggested by the documentation Bytes 12-15 (e9 75 38 bc) this corresponds to the CRC (decimal 3916773564). No problem here. Bytes 16-17 (00 00) not sure what this is. Bytes 18-21 (ff ff ff ff) not sure what this is. A “magic number”? But that should be just one byte. Must be something else? Bytes 22-25 (00 00 00 08) is the message payload size (8), this is the value of “n” in the formula for message length, exactly the length of the “message1” string. No problem here. Bytes 26-33 (6d 65 73 73 61 67 65 31) is the payload (ascii: message1). No problem here. Can anyone on the list help me reconcile the documentation to what I see on the disk? Specifically: a) what are the first 8 bytes supposed to represent? b) the message length field as described as 1+4+n doesn’t correspond with what I see on disk. It looks like 4 (crc) + 2 (??) + 4 (?magic number?) + 4 (payload length) + 8 (n). What is the correct formula? c) why does the CRC appear so early in the message (bytes 8-11), shouldn’t the magic value appear before the CRC? d) what is the way to interpret bytes 16-21? is the magic number in here somewhere? What else is in this set of bytes? Thanks sgg