Incidentally, I'd like to note that this did *not* occur in my testing environment (which didn't expire any unexpected segments after upgrading), so if it is a feature, it's certainly a hit-or-miss one.
On Mon, Oct 31, 2016 at 4:14 PM, James Brown <jbr...@easypost.com> wrote: > I just finished upgrading our main production cluster to 0.10.1.0 (from > 0.9.0.1) with an on-line rolling upgrade, and I noticed something strange — > the leader for one of our big partitions just decided to expire all of the > logs from before the upgrade. I have log.retention.hours set to 336 in my > config, and the replicas still have data going back to October 17, but > after restarting for 0.10.1.0, the topic leader deleted all segments more > than a couple of hours old (approximately 2TB of data on that box). > > inter.broker.protocol.version and log.message.format.version are both > still set to 0.9.0.1 in my config > > Before the upgrade, the oldest available offset in this topic/partition > was 812555925; now it's 848947551. > > I assume this is some bug with upgrading to 0.10.1.0 when the extant data > doesn't have any associated timestamps, but it seems, uh, really > unexpected, and if I'd had any consumers which were behind, I could've > ended up losing quite a lot of data here. It's particularly bizarre that > this didn't affect anything except the leader (yet). > > It may be that this is expected behavior, but I guess I just assumed that > the code would fall back to using the mtime if timestamps were not present > in the log rather than assuming that the timestamp of a given segment was > 0. If this is expected behavior, I would recommend adding a specific note > the the "Potential breaking changes in 0.10.1.0" section of the manual > indicating that upgrading from 0.9.0.1 might immediately truncate all of > your data. > > > Debugging output is below: > > > % kafka-topics.sh --zookeeper localhost:40169 --describe --topic > easypost.request_log > Topic:easypost.request_log PartitionCount:4 ReplicationFactor:3 Configs: > Topic: easypost.request_log Partition: 0 Leader: 1 Replicas: 1,4,2 Isr: > 4,2,1 > Topic: easypost.request_log Partition: 1 Leader: 4 Replicas: 4,1,2 Isr: > 4,2,1 > Topic: easypost.request_log Partition: 2 Leader: 5 Replicas: 5,2,3 Isr: > 5,3,2 > Topic: easypost.request_log Partition: 3 Leader: 3 Replicas: 3,5,2 Isr: > 5,3,2 > > (on broker #1): > > % ls -l /srv/var/kafka/logs/easypost.request_log-0/ | wc -l > 25 > > (on broker #4): > > % ls -l /srv/var/kafka/logs/easypost.request_log-0/ | wc -l > 3391 > > > When the actual deletion occurred, there were no errors in the log; just a > lot messages like > > INFO Scheduling log segment 811849601 for log easypost.request_log-0 for > deletion. (kafka.log.Log) > > > I suspect it's too late to un-do anything related to this, and I don't > actually think any of our consumers were relying on this data, but I > figured I'd send along this report and see if anybody else has seen > behavior like this. > > Thanks, > -- > James Brown > Engineer > -- James Brown Engineer