I am attempting to understand the details of the content of the log segment 
file in Kafka.

The documentation (http://kafka.apache.org/081/documentation.html#log)  
suggests:
The exact binary format for messages is versioned and maintained as a standard 
interface so message sets can be transfered between producer, broker, and 
client without recopying or conversion when desirable. This format is as 
follows:

On-disk format of a message

message length : 4 bytes (value: 1+4+n) 
"magic" value  : 1 byte
crc            : 4 bytes
payload        : n bytes


But I am struggling to map the documentation to what I see on the disk.

I created a topic, named simple-topic, and added one message to it (via the 
console producer).  The message payload was “message1”.

The DumpLogSegments tool shows:
Dumping /tmp/kafka-logs/sample-topic-0/00000000000000000000.log
Starting offset: 0
offset: 0 position: 0 isvalid: true payloadsize: 8 magic: 0 compresscodec: 
NoCompressionCodec crc: 3916773564

Taking a hex dump of the (only) log file:
sample-topic-0 sgg$ hexdump -C 00000000000000000000.log | more
00000000  00 00 00 00 00 00 00 00  00 00 00 16 e9 75 38 bc  |.............u8.|
00000010  00 00 ff ff ff ff 00 00  00 08 6d 65 73 73 61 67  |..........messag|
00000020  65 31                                             |e1|
00000022

I tried to “reverse engineer” the contents, to see how it corresponds to the 
documentation:

Bytes 0-7 (00 00 00 00 00 00 00 00).  I am not sure what this is, some sort of 
filler?
Bytes 8-11 (00 00 00 16) seems to be some length field?  Decimal 22, which 
seems to correspond to the length of the entire message, but more than 1+4+n 
than suggested by the documentation
Bytes 12-15 (e9 75 38 bc) this corresponds to the CRC (decimal 3916773564).  No 
problem here.
Bytes 16-17 (00 00) not sure what this is.
Bytes 18-21 (ff ff ff ff) not sure what this is. A “magic number”?  But that 
should be just one byte.  Must be something else?
Bytes 22-25 (00 00  00 08) is the message payload size (8), this is the value 
of “n” in the formula for message length, exactly the length of the “message1” 
string. No problem here.
Bytes 26-33 (6d 65 73 73 61 67 65 31) is the payload (ascii: message1).  No 
problem here.

Can anyone on the list help me reconcile the documentation to what I see on the 
disk?  Specifically:
a) what are the first 8 bytes supposed to represent?  
b) the message length field as described as 1+4+n doesn’t correspond with what 
I see on disk.  It looks like 4 (crc) + 2 (??) + 4 (?magic number?) + 4 
(payload length) + 8 (n).  What is the correct formula?
c) why does the CRC appear so early in the message (bytes 8-11), shouldn’t the 
magic value appear before the CRC?
d) what is the way to interpret bytes 16-21?  is the magic number in here 
somewhere?  What else is in this set of bytes?

Thanks
sgg

Reply via email to