Re: How to recover from a disk full situation in Kafka cluster?

Connie Yang Fri, 18 Jul 2014 16:24:07 -0700

Sure, I will try to take a snapshot of the data distribution when it
happens next time.


Assuming the topic.metadata.refresh.interval.ms is the concern, how should
we "unstuck" our producers?

The important note from that the documentation seems to suggest that the
metadata refresh will only happen AFTER the message is sent.

The producer generally refreshes the topic metadata from brokers when there
is a failure (partition missing, leader not available...). It will also
poll regularly (default: every 10min so 600000ms). If you set this to a
negative value, metadata will only get refreshed on failure. If you set
this to zero, the metadata will get refreshed after each message sent (not
recommended). Important note: the refresh happen only AFTER the message is
sent, so if the producer never sends a message the metadata is never
refreshed

Thanks,
Connie




On Fri, Jul 18, 2014 at 3:58 PM, Neha Narkhede <neha.narkh...@gmail.com>
wrote:

> Does this mean that we should set "auto.leader.rebalance.enable" to true?
>
> I wouldn't recommend that just yet since it is not known to be very stable.
> You mentioned that only 2 brokers ever took the traffic and the replication
> factor is 2, makes me think that the producer stuck to 1 or few partitions
> instead of distributing the data over all the partitions. This is a known
> problem in the old producer where the default value of a config (
> topic.metadata.refresh.interval.ms), that controls how long a producer
> sticks to certain partitions, is 10 mins. So it effectively does not
> distribute data evenly across all partitions.
>
> If you see the same behavior next time, try to take a snapshot of data
> distribution across all partitions to verify this theory.
>
> Thanks,
> Neha
>
>
> On Thu, Jul 17, 2014 at 5:43 PM, Connie Yang <cybercon...@gmail.com>
> wrote:
>
> > It might appear that the data is not balanced, but it could be as a
> result
> > of the imbalanced leaders setting.
> >
> > Does this mean that we should set "auto.leader.rebalance.enable" to true?
> >  Any other configuration we need to change as well?  As I mentioned
> before,
> > we use pretty much use the default setting.
> >
> > All of our topics have replication factor of 2 (aka 2 copies per
> message).
> >
> > We don't have the topic output when we had the problem, but here's our
> > topic output after we ran the kafka-preferred-replica-election.sh tool as
> > suggested:
> >
> > $KAFKA_HOME/bin/kafka-topics.sh   --zookeeper
> > zkHost1:2181,zkHost2:2181,zkHost3:2181 --describe --topic=myKafkaTopic
> > Topic:myKafkaTopic PartitionCount:24 ReplicationFactor:2 Configs:
> > retention.ms=43200000
> > Topic: myKafkTopic Partition: 0 Leader: 2 Replicas: 2,1 Isr: 1,2
> > Topic: myKafkTopic Partition: 1 Leader: 3 Replicas: 3,2 Isr: 3,2
> > Topic: myKafkTopic Partition: 2 Leader: 4 Replicas: 4,3 Isr: 3,4
> > Topic: myKafkTopic Partition: 3 Leader: 5 Replicas: 5,4 Isr: 5,4
> > Topic: myKafkTopic Partition: 4 Leader: 6 Replicas: 6,5 Isr: 5,6
> > Topic: myKafkTopic Partition: 5 Leader: 7 Replicas: 7,6 Isr: 6,7
> > Topic: myKafkTopic Partition: 6 Leader: 8 Replicas: 8,7 Isr: 7,8
> > Topic: myKafkTopic Partition: 7 Leader: 9 Replicas: 9,8 Isr: 9,8
> > Topic: myKafkTopic Partition: 8 Leader: 10 Replicas: 10,9 Isr: 10,9
> > Topic: myKafkTopic Partition: 9 Leader: 11 Replicas: 11,10 Isr: 11,10
> > Topic: myKafkTopic Partition: 10 Leader: 12 Replicas: 12,11 Isr: 11,12
> > Topic: myKafkTopic Partition: 11 Leader: 13 Replicas: 13,12 Isr: 12,13
> > Topic: myKafkTopic Partition: 12 Leader: 14 Replicas: 14,13 Isr: 14,13
> > Topic: myKafkTopic Partition: 13 Leader: 15 Replicas: 15,14 Isr: 14,15
> > Topic: myKafkTopic Partition: 14 Leader: 16 Replicas: 16,15 Isr: 16,15
> > Topic: myKafkTopic Partition: 15 Leader: 17 Replicas: 17,16 Isr: 16,17
> > Topic: myKafkTopic Partition: 16 Leader: 18 Replicas: 18,17 Isr: 18,17
> > Topic: myKafkTopic Partition: 17 Leader: 19 Replicas: 19,18 Isr: 18,19
> > Topic: myKafkTopic Partition: 18 Leader: 20 Replicas: 20,19 Isr: 20,19
> > Topic: myKafkTopic Partition: 19 Leader: 21 Replicas: 21,20 Isr: 20,21
> > Topic: myKafkTopic Partition: 20 Leader: 22 Replicas: 22,21 Isr: 22,21
> > Topic: myKafkTopic Partition: 21 Leader: 23 Replicas: 23,22 Isr: 23,22
> > Topic: myKafkTopic Partition: 22 Leader: 24 Replicas: 24,23 Isr: 23,24
> > Topic: myKafkTopic Partition: 23 Leader: 1 Replicas: 1,24 Isr: 1,24
> >
> > Thanks,
> > Connie
> >
> >
> >
> > On Thu, Jul 17, 2014 at 4:20 PM, Neha Narkhede <neha.narkh...@gmail.com>
> > wrote:
> >
> > > Connie,
> > >
> > > After we freed up the
> > > cluster disk space and adjusted the broker data retention policy, we
> > > noticed that the cluster partition was not balanced based on topic
> > describe
> > > script came from Kafka 0.8.1.1 distribution.
> > >
> > > When you say the cluster was not balanced, did you mean the leaders or
> > the
> > > data? The describe topic tool does not give information about data
> sizes,
> > > so I'm assuming you are referring to leader imbalance. If so, the right
> > > tool to run is kafka-preferred-replica-election.sh not partition
> > > reassignment. In general, assuming the partitions were evenly
> distributed
> > > on your cluster before you ran out of disk space, the only thing you
> > should
> > > need to do to recover is delete a few older segments and bounce each
> > > broker, one at a time. It is also preferrable to run preferred replica
> > > election after a complete cluster bounce so the leaders are well
> > > distributed.
> > >
> > > Also, it will help if you can send around the output of the describe
> > topic
> > > tool. I wonder if your topics have a replication factor of 1
> > inadvertently?
> > >
> > > Thanks,
> > > Neha
> > >
> > >
> > > On Thu, Jul 17, 2014 at 11:57 AM, Connie Yang <cybercon...@gmail.com>
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > Our Kafka cluster ran out of disk space yesterday.  After we freed up
> > the
> > > > cluster disk space and adjusted the broker data retention policy, we
> > > > noticed that the cluster partition was not balanced based on topic
> > > describe
> > > > script came from Kafka 0.8.1.1 distribution.  So, we tried to
> rebalance
> > > the
> > > > partition using the kafka-reassign-partitions.sh. After sometime
> later,
> > > we
> > > > ran out of disk space on 2 brokers in the cluster while the rest have
> > > > plenty of disk space left.
> > > >
> > > > This seems to suggest that only two brokers were receiving messages.
> >  We
> > > > have not changed the broker partition from our producer which uses a
> > > random
> > > > partition key strategy.
> > > >
> > > > String uuid = UUID.randomUUID().toString();
> > > > KeyedMessage<String, String> data = new KeyedMessage<String, String>(
> > > > "myKafkaTopic"
> > > > uuid, msgBuilder.toString());
> > > >
> > > >
> > > > Questions
> > > > 1. Is partition reassignment required after disk full or when some of
> > the
> > > > brokers are not healthy?
> > > > 2. Is there a broker config that we can use to auto rebalance the
> > broker
> > > > partition?  Should  "auto.leader.rebalance.enable" set to true?
> > > > 2. How do we recover from situation like this?
> > > >
> > > > We pretty much use default configuration on the broker.
> > > >
> > > > Thanks,
> > > > Connie
> > > >
> > >
> >
>

Re: How to recover from a disk full situation in Kafka cluster?

Reply via email to