After a recent 0.8.2.1 upgrade we noticed a significant increase in used filesystem space for our Kafka log data. We have another Kafka cluster still on 0.8.1.1 whose Kafka data is being copied over to the upgraded cluster, and it is clear that the disk consumption is higher on 0.8.2.1 for the same message data. The log retention config for the two clusters is the same also.
We ran some tests to figure out what was happening, and it appears that in 0.8.2.1 the Kafka brokers re-compress each message individually (we’re using Snappy), while in 0.8.1.1 they applied the compression across an entire batch of messages written to the log. For producers sending large batches of small similar messages, the difference can be quite substantial (in our case, it looks like a little over 2x). Is this a bug, or the expected new behavior? thanks, Andrew CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.