I just finished upgrading our main production cluster to 0.10.1.0 (from 0.9.0.1) with an on-line rolling upgrade, and I noticed something strange — the leader for one of our big partitions just decided to expire all of the logs from before the upgrade. I have log.retention.hours set to 336 in my config, and the replicas still have data going back to October 17, but after restarting for 0.10.1.0, the topic leader deleted all segments more than a couple of hours old (approximately 2TB of data on that box).
inter.broker.protocol.version and log.message.format.version are both still set to 0.9.0.1 in my config Before the upgrade, the oldest available offset in this topic/partition was 812555925; now it's 848947551. I assume this is some bug with upgrading to 0.10.1.0 when the extant data doesn't have any associated timestamps, but it seems, uh, really unexpected, and if I'd had any consumers which were behind, I could've ended up losing quite a lot of data here. It's particularly bizarre that this didn't affect anything except the leader (yet). It may be that this is expected behavior, but I guess I just assumed that the code would fall back to using the mtime if timestamps were not present in the log rather than assuming that the timestamp of a given segment was 0. If this is expected behavior, I would recommend adding a specific note the the "Potential breaking changes in 0.10.1.0" section of the manual indicating that upgrading from 0.9.0.1 might immediately truncate all of your data. Debugging output is below: % kafka-topics.sh --zookeeper localhost:40169 --describe --topic easypost.request_log Topic:easypost.request_log PartitionCount:4 ReplicationFactor:3 Configs: Topic: easypost.request_log Partition: 0 Leader: 1 Replicas: 1,4,2 Isr: 4,2,1 Topic: easypost.request_log Partition: 1 Leader: 4 Replicas: 4,1,2 Isr: 4,2,1 Topic: easypost.request_log Partition: 2 Leader: 5 Replicas: 5,2,3 Isr: 5,3,2 Topic: easypost.request_log Partition: 3 Leader: 3 Replicas: 3,5,2 Isr: 5,3,2 (on broker #1): % ls -l /srv/var/kafka/logs/easypost.request_log-0/ | wc -l 25 (on broker #4): % ls -l /srv/var/kafka/logs/easypost.request_log-0/ | wc -l 3391 When the actual deletion occurred, there were no errors in the log; just a lot messages like INFO Scheduling log segment 811849601 for log easypost.request_log-0 for deletion. (kafka.log.Log) I suspect it's too late to un-do anything related to this, and I don't actually think any of our consumers were relying on this data, but I figured I'd send along this report and see if anybody else has seen behavior like this. Thanks, -- James Brown Engineer