Re: How to recover from a disk full situation in Kafka cluster?

Neha Narkhede Fri, 18 Jul 2014 15:59:27 -0700

Does this mean that we should set "auto.leader.rebalance.enable" to true?


I wouldn't recommend that just yet since it is not known to be very stable.
You mentioned that only 2 brokers ever took the traffic and the replication
factor is 2, makes me think that the producer stuck to 1 or few partitions
instead of distributing the data over all the partitions. This is a known
problem in the old producer where the default value of a config (
topic.metadata.refresh.interval.ms), that controls how long a producer
sticks to certain partitions, is 10 mins. So it effectively does not
distribute data evenly across all partitions.

If you see the same behavior next time, try to take a snapshot of data
distribution across all partitions to verify this theory.

Thanks,
Neha


On Thu, Jul 17, 2014 at 5:43 PM, Connie Yang <[email protected]> wrote:

> It might appear that the data is not balanced, but it could be as a result
> of the imbalanced leaders setting.
>
> Does this mean that we should set "auto.leader.rebalance.enable" to true?
>  Any other configuration we need to change as well?  As I mentioned before,
> we use pretty much use the default setting.
>
> All of our topics have replication factor of 2 (aka 2 copies per message).
>
> We don't have the topic output when we had the problem, but here's our
> topic output after we ran the kafka-preferred-replica-election.sh tool as
> suggested:
>
> $KAFKA_HOME/bin/kafka-topics.sh   --zookeeper
> zkHost1:2181,zkHost2:2181,zkHost3:2181 --describe --topic=myKafkaTopic
> Topic:myKafkaTopic PartitionCount:24 ReplicationFactor:2 Configs:
> retention.ms=43200000
> Topic: myKafkTopic Partition: 0 Leader: 2 Replicas: 2,1 Isr: 1,2
> Topic: myKafkTopic Partition: 1 Leader: 3 Replicas: 3,2 Isr: 3,2
> Topic: myKafkTopic Partition: 2 Leader: 4 Replicas: 4,3 Isr: 3,4
> Topic: myKafkTopic Partition: 3 Leader: 5 Replicas: 5,4 Isr: 5,4
> Topic: myKafkTopic Partition: 4 Leader: 6 Replicas: 6,5 Isr: 5,6
> Topic: myKafkTopic Partition: 5 Leader: 7 Replicas: 7,6 Isr: 6,7
> Topic: myKafkTopic Partition: 6 Leader: 8 Replicas: 8,7 Isr: 7,8
> Topic: myKafkTopic Partition: 7 Leader: 9 Replicas: 9,8 Isr: 9,8
> Topic: myKafkTopic Partition: 8 Leader: 10 Replicas: 10,9 Isr: 10,9
> Topic: myKafkTopic Partition: 9 Leader: 11 Replicas: 11,10 Isr: 11,10
> Topic: myKafkTopic Partition: 10 Leader: 12 Replicas: 12,11 Isr: 11,12
> Topic: myKafkTopic Partition: 11 Leader: 13 Replicas: 13,12 Isr: 12,13
> Topic: myKafkTopic Partition: 12 Leader: 14 Replicas: 14,13 Isr: 14,13
> Topic: myKafkTopic Partition: 13 Leader: 15 Replicas: 15,14 Isr: 14,15
> Topic: myKafkTopic Partition: 14 Leader: 16 Replicas: 16,15 Isr: 16,15
> Topic: myKafkTopic Partition: 15 Leader: 17 Replicas: 17,16 Isr: 16,17
> Topic: myKafkTopic Partition: 16 Leader: 18 Replicas: 18,17 Isr: 18,17
> Topic: myKafkTopic Partition: 17 Leader: 19 Replicas: 19,18 Isr: 18,19
> Topic: myKafkTopic Partition: 18 Leader: 20 Replicas: 20,19 Isr: 20,19
> Topic: myKafkTopic Partition: 19 Leader: 21 Replicas: 21,20 Isr: 20,21
> Topic: myKafkTopic Partition: 20 Leader: 22 Replicas: 22,21 Isr: 22,21
> Topic: myKafkTopic Partition: 21 Leader: 23 Replicas: 23,22 Isr: 23,22
> Topic: myKafkTopic Partition: 22 Leader: 24 Replicas: 24,23 Isr: 23,24
> Topic: myKafkTopic Partition: 23 Leader: 1 Replicas: 1,24 Isr: 1,24
>
> Thanks,
> Connie
>
>
>
> On Thu, Jul 17, 2014 at 4:20 PM, Neha Narkhede <[email protected]>
> wrote:
>
> > Connie,
> >
> > After we freed up the
> > cluster disk space and adjusted the broker data retention policy, we
> > noticed that the cluster partition was not balanced based on topic
> describe
> > script came from Kafka 0.8.1.1 distribution.
> >
> > When you say the cluster was not balanced, did you mean the leaders or
> the
> > data? The describe topic tool does not give information about data sizes,
> > so I'm assuming you are referring to leader imbalance. If so, the right
> > tool to run is kafka-preferred-replica-election.sh not partition
> > reassignment. In general, assuming the partitions were evenly distributed
> > on your cluster before you ran out of disk space, the only thing you
> should
> > need to do to recover is delete a few older segments and bounce each
> > broker, one at a time. It is also preferrable to run preferred replica
> > election after a complete cluster bounce so the leaders are well
> > distributed.
> >
> > Also, it will help if you can send around the output of the describe
> topic
> > tool. I wonder if your topics have a replication factor of 1
> inadvertently?
> >
> > Thanks,
> > Neha
> >
> >
> > On Thu, Jul 17, 2014 at 11:57 AM, Connie Yang <[email protected]>
> > wrote:
> >
> > > Hi All,
> > >
> > > Our Kafka cluster ran out of disk space yesterday.  After we freed up
> the
> > > cluster disk space and adjusted the broker data retention policy, we
> > > noticed that the cluster partition was not balanced based on topic
> > describe
> > > script came from Kafka 0.8.1.1 distribution.  So, we tried to rebalance
> > the
> > > partition using the kafka-reassign-partitions.sh. After sometime later,
> > we
> > > ran out of disk space on 2 brokers in the cluster while the rest have
> > > plenty of disk space left.
> > >
> > > This seems to suggest that only two brokers were receiving messages.
>  We
> > > have not changed the broker partition from our producer which uses a
> > random
> > > partition key strategy.
> > >
> > > String uuid = UUID.randomUUID().toString();
> > > KeyedMessage<String, String> data = new KeyedMessage<String, String>(
> > > "myKafkaTopic"
> > > uuid, msgBuilder.toString());
> > >
> > >
> > > Questions
> > > 1. Is partition reassignment required after disk full or when some of
> the
> > > brokers are not healthy?
> > > 2. Is there a broker config that we can use to auto rebalance the
> broker
> > > partition?  Should  "auto.leader.rebalance.enable" set to true?
> > > 2. How do we recover from situation like this?
> > >
> > > We pretty much use default configuration on the broker.
> > >
> > > Thanks,
> > > Connie
> > >
> >
>

Re: How to recover from a disk full situation in Kafka cluster?

Reply via email to