Hi everyone, We run a 3 Kafka cluster using 0.8.1.1 with all topics having a replication factor of 3 meaning every broker has a replica of every partition.
We recently ran into this issue ( https://issues.apache.org/jira/browse/KAFKA-1028) and saw data loss within Kafka. We understand why it happened and have plans to try to ensure it doesn't happen again. The strange part was that the broker that was chosen for the un-clean leader election seemed to drop all of its own data about the partition in the process as our monitoring shows the broker offset was reset to 0 for a number of partitions. Following the broker's server logs in chronological order for a particular partition that saw data loss I see this, 2014-10-16 10:18:11,104 INFO kafka.log.Log: Completed load of log TOPIC-6 with log end offset 528026 2014-10-16 10:20:18,144 WARN kafka.controller.OfflinePartitionLeaderSelector: [OfflinePartitionLeaderSelector]: No broker in ISR is alive for [TOPIC,6]. Elect leader 1 from live brokers 1,2. There's potential data loss. 2014-10-16 10:20:18,277 WARN kafka.cluster.Partition: Partition [TOPIC,6] on broker 1: No checkpointed highwatermark is found for partition [TOPIC,6] 2014-10-16 10:20:18,698 INFO kafka.log.Log: Truncating log TOPIC-6 to offset 0. 2014-10-16 10:21:18,788 INFO kafka.log.OffsetIndex: Deleting index /storage/kafka/00/kafka_data/TOPIC-6/00000000000000528024.index.deleted 2014-10-16 10:21:18,781 INFO kafka.log.Log: Deleting segment 528024 from log TOPIC-6. I'm not too worried about this since I'm hoping to move to Kafka 0.8.2 ASAP but I was curious if anyone could explain this behavior. -Bryan