Incidentally, I'd like to note that this did *not* occur in my testing
environment (which didn't expire any unexpected segments after upgrading),
so if it is a feature, it's certainly a hit-or-miss one.

On Mon, Oct 31, 2016 at 4:14 PM, James Brown <jbr...@easypost.com> wrote:

> I just finished upgrading our main production cluster to 0.10.1.0 (from
> 0.9.0.1) with an on-line rolling upgrade, and I noticed something strange —
> the leader for one of our big partitions just decided to expire all of the
> logs from before the upgrade. I have log.retention.hours set to 336 in my
> config, and the replicas still have data going back to October 17, but
> after restarting for 0.10.1.0, the topic leader deleted all segments more
> than a couple of hours old (approximately 2TB of data on that box).
>
> inter.broker.protocol.version and log.message.format.version are both
> still set to 0.9.0.1 in my config
>
> Before the upgrade, the oldest available offset in this topic/partition
> was 812555925; now it's 848947551.
>
> I assume this is some bug with upgrading to 0.10.1.0 when the extant data
> doesn't have any associated timestamps, but it seems, uh, really
> unexpected, and if I'd had any consumers which were behind, I could've
> ended up losing quite a lot of data here. It's particularly bizarre that
> this didn't affect anything except the leader (yet).
>
> It may be that this is expected behavior, but I guess I just assumed that
> the code would fall back to using the mtime if timestamps were not present
> in the log rather than assuming that the timestamp of a given segment was
> 0. If this is expected behavior, I would recommend adding a specific note
> the the "Potential breaking changes in 0.10.1.0" section of the manual
> indicating that upgrading from 0.9.0.1 might immediately truncate all of
> your data.
>
>
> Debugging output is below:
>
>
> % kafka-topics.sh --zookeeper localhost:40169 --describe  --topic
> easypost.request_log
> Topic:easypost.request_log PartitionCount:4 ReplicationFactor:3 Configs:
> Topic: easypost.request_log Partition: 0 Leader: 1 Replicas: 1,4,2 Isr:
> 4,2,1
> Topic: easypost.request_log Partition: 1 Leader: 4 Replicas: 4,1,2 Isr:
> 4,2,1
> Topic: easypost.request_log Partition: 2 Leader: 5 Replicas: 5,2,3 Isr:
> 5,3,2
> Topic: easypost.request_log Partition: 3 Leader: 3 Replicas: 3,5,2 Isr:
> 5,3,2
>
> ​(on broker #1):
>
> % ls -l /srv/var/kafka/logs/easypost.request_log-0/ | wc -l
> 25
>
> (on broker #4):
>
> % ls -l /srv/var/kafka/logs/easypost.request_log-0/ | wc -l
> 3391
>
>
> When the actual deletion occurred, there were no errors in the log; just a
> lot messages like
>
> INFO Scheduling log segment 811849601 for log easypost.request_log-0 for
> deletion. (kafka.log.Log)
>
>
> ​I suspect it's too late to un-do anything related to this, and I don't
> actually think any of our consumers were relying on this data, but I
> figured I'd send along this report and see if anybody else has seen
> behavior like this.
>
> Thanks,​
> --
> James Brown
> Engineer
>



-- 
James Brown
Engineer

Reply via email to