Cascading failures on running out of disk space

Jananee S Wed, 27 May 2015 07:55:51 -0700

Hi,

  We have the following setup -


Number of brokers: 3
Number of zookeepers: 3
Default replication factor: 3
Offets Storage: kafka

When one of our brokers ran out of disk space, we started seeing lot of
errors in the broker logs at an alarming rate. This caused the other
brokers also to run out of disk space.

ERROR [ReplicaFetcherThread-0-101813211], Error for partition [xxxx,47] to
broker 101813211:class kafka.common.UnknownException
(kafka.server.ReplicaFetcherThread)

WARN [Replica Manager on Broker 101813211]: Fetch request with correlation
id 161672 from client ReplicaFetcherThread-0-101813211 on partition
[xxxx,11] failed due to Leader not local for partition [xxxx,11] on broker
101813211 (kafka.server.ReplicaManager)

We also noticed NotLeaderForPartitionException in the producer and consumer
logs (also at alarming rate)

ERROR [2015-05-27 09:54:48,613] kafka.consumer.ConsumerFetcherThread: [
ConsumerFetcherThread-xxxx_prod2-1432719772385-bd7608b8-0-101813211], Error
for partition [yyyy,1] to broker 101813211:class kafka.common.
NotLeaderForPartitionException

The __consumer_offsets topic somehow got corrupted and consumers started
consuming already consumed messages on restart.

We deleted the offending topic and tried restarting the brokers and
zookeepers. Now we are getting lots of corrupt index errors on broker start
up.

Was all this due to the replication factor being the same as number of
brokers? Why would the topic files get corrupted in such a scenario?
Please let us know how to recover from this scenario. Also, how do we turn
down the error logging rate?

Thanks,
Jananee

Cascading failures on running out of disk space

Reply via email to