Hi, We have the following setup -
Number of brokers: 3 Number of zookeepers: 3 Default replication factor: 3 Offets Storage: kafka When one of our brokers ran out of disk space, we started seeing lot of errors in the broker logs at an alarming rate. This caused the other brokers also to run out of disk space. ERROR [ReplicaFetcherThread-0-101813211], Error for partition [xxxx,47] to broker 101813211:class kafka.common.UnknownException (kafka.server.ReplicaFetcherThread) WARN [Replica Manager on Broker 101813211]: Fetch request with correlation id 161672 from client ReplicaFetcherThread-0-101813211 on partition [xxxx,11] failed due to Leader not local for partition [xxxx,11] on broker 101813211 (kafka.server.ReplicaManager) We also noticed NotLeaderForPartitionException in the producer and consumer logs (also at alarming rate) ERROR [2015-05-27 09:54:48,613] kafka.consumer.ConsumerFetcherThread: [ ConsumerFetcherThread-xxxx_prod2-1432719772385-bd7608b8-0-101813211], Error for partition [yyyy,1] to broker 101813211:class kafka.common. NotLeaderForPartitionException The __consumer_offsets topic somehow got corrupted and consumers started consuming already consumed messages on restart. We deleted the offending topic and tried restarting the brokers and zookeepers. Now we are getting lots of corrupt index errors on broker start up. Was all this due to the replication factor being the same as number of brokers? Why would the topic files get corrupted in such a scenario? Please let us know how to recover from this scenario. Also, how do we turn down the error logging rate? Thanks, Jananee