We run 5 node kafka cluster in production with replication factor is 3. If we lose a broker for couple days and kafka-data is wiped off when it comes back online, we had to do rolling restart of all brokers to make them heathy.
It recovers itself for most part, FailedFetchRequests and UnderReplicatedPartitions decreases slowly after failed broker comes online. But after some time UnderReplicatedPartitions is flat for 2 brokers and it never drops to zero. When I checked broker logs, I see this exception 2015-07-29 02:15:57,289 [kafka-request-handler-5] ERROR (kafka.server.ReplicaManager) - [Replica Manager on Broker 4]: Error when processing fetch request for partition [com.salesforce.mandm.ajna.Metric.puppet.system,7] offset 5627 from follower with correlation id 2425050. Possible cause: Request for offset 5627 but we only have log segments in the range 5808 to 5985. 2015-07-29 02:15:57,289 [kafka-network-thread-6667-3] ERROR (kafka.network.Processor) - Closing socket for kafka-broker-host1 because of error kafka.common.KafkaException: This operation cannot be completed on a complete request. kafka-broker-host1 is the failed broker that came online. Is this a bug or expected behavior..? Are we supposed to always do rolling restart if kafka-data dir in one broker is wiped off? BTW, we did not had any impact to producers or consumers, we only lost some replications until rolling restart was done. -- Thanks, Raja.