Re: Broker brought down and under replicated partitions

Jean-Pascal Billaud Wed, 15 Oct 2014 10:00:13 -0700

So I am using 0.8.0. I think I found the issue actually. It turns out that
some partitions only had a single replica and the leaders of those
partitions would basically "refuse" new writes. As soon as I reassigned
replicas to those partitions things kicked off again. Not sure if that's
expected... but that seemed to make the problem go away.


Thanks,

On Wed, Oct 15, 2014 at 6:46 AM, Neha Narkhede <neha.narkh...@gmail.com>
wrote:

> Which version of Kafka are you using? The current stable one is 0.8.1.1
>
> On Tue, Oct 14, 2014 at 5:51 PM, Jean-Pascal Billaud <j...@tellapart.com>
> wrote:
>
> > Hey Neha,
> >
> > so I removed another broker like 30mn ago and since then basically the
> > Producer is dying with:
> >
> > Event queue is full of unsent messages, could not send event:
> > KeyedMessage(my_topic,[B@1b71b7a6,[B@35fdd1e7)
> > kafka.common.QueueFullException: Event queue is full of unsent messages,
> > could not send event: KeyedMessage(my_topic,[B@1b71b7a6,[B@35fdd1e7)
> > at kafka.producer.Producer$$anonfun$asyncSend$1.apply(Unknown Source)
> > ~[kafka_2.10-0.8.0.jar:0.8.0]
> > at kafka.producer.Producer$$anonfun$asyncSend$1.apply(Unknown Source)
> > ~[kafka_2.10-0.8.0.jar:0.8.0]
> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> > ~[scala-library-2.10.3.jar:na]
> > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> > ~[scala-library-2.10.3.jar:na]
> > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> > ~[scala-library-2.10.3.jar:na]
> > at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> > ~[scala-library-2.10.3.jar:na]
> > at kafka.producer.Producer.asyncSend(Unknown Source)
> > ~[kafka_2.10-0.8.0.jar:0.8.0]
> > at kafka.producer.Producer.send(Unknown Source)
> > ~[kafka_2.10-0.8.0.jar:0.8.0]
> > at kafka.javaapi.producer.Producer.send(Unknown Source)
> > ~[kafka_2.10-0.8.0.jar:0.8.0]
> >
> > It seems like it cannot recover for some reasons. The new leaders were
> > elected it seems like so it should have picked up the new meta data
> > information about the partitions. Is this something known from 0.8.0?
> What
> > should be looking for to debug/fix this?
> >
> > Thanks,
> >
> > On Tue, Oct 14, 2014 at 2:22 PM, Neha Narkhede <neha.narkh...@gmail.com>
> > wrote:
> >
> > > Regarding (1), I am assuming that it is expected that brokers going
> down
> > > will be brought back up soon. At which point, they will pick up from
> the
> > > current leader and get back into the ISR. Am I right?
> > >
> > > The broker will be added back to the ISR once it is restarted, but it
> > never
> > > goes out of the replica list until the admin explicitly moves it using
> > the
> > > reassign partitions tool.
> > >
> > > Regarding (2), I finally kicked off a reassign_partitions admin task
> > adding
> > > broker 7 to the replicas list for partition 0 which finally fixed the
> > under
> > > replicated issue:
> > > Is this therefore expected that the user will fix up the under
> > replication
> > > situation?
> > >
> > > Yes. Currently, partition reassignment is purely an admin initiated
> task.
> > >
> > > Another thing I'd like to clarify is that for another topic Y, broker 5
> > was
> > > never removed from the ISR array. Note that Y is an unused topic so I
> am
> > > guessing that technically broker 5 is not out of sync... though it is
> > still
> > > dead. Is this the expected behavior?
> > >
> > > Not really. After replica.lag.time.max.ms (which defaults to 10
> > seconds),
> > > the leader should remove the dead broker out of the ISR.
> > >
> > > Thanks,
> > > Neha
> > >
> > > On Tue, Oct 14, 2014 at 9:27 AM, Jean-Pascal Billaud <j...@tellapart.com
> >
> > > wrote:
> > >
> > > > hey folks,
> > > >
> > > > I have been testing a kafka cluster of 10 nodes on AWS using version
> > > > 2.8.0-0.8.0
> > > > and see some behavior on failover that I want to make sure I
> > understand.
> > > >
> > > > Initially, I have a topic X with 30 partitions and a replication
> factor
> > > of
> > > > 3. Looking at the partition 0:
> > > > partition: 0 - leader: 5 preferred leader: 5 brokers: [5, 3, 4]
> > in-sync:
> > > > [5, 3, 4]
> > > >
> > > > While killing broker 5, the controller immediately grab the next
> > replica
> > > in
> > > > the ISR and assign it as a leader:
> > > > partition: 0 - leader: 3 preferred leader: 5 brokers: [5, 3, 4]
> > in-sync:
> > > > [3, 4]
> > > >
> > > > There are couple of things at this point I would like to clarify:
> > > >
> > > > (1) Why is broker 5 still in the brokers array for partition 0? Note
> > this
> > > > broker array comes from a get of the zookeeper path
> > > /brokers/topics/[topic]
> > > > as documented.
> > > > (2) Partition 0 is now under replicated and the controller does not
> > seem
> > > to
> > > > do anything about. Is this expected?
> > > >
> > > > Regarding (1), I am assuming that it is expected that brokers going
> > down
> > > > will be brought back up soon. At which point, they will pick up from
> > the
> > > > current leader and get back into the ISR. Am I right?
> > > >
> > > > Regarding (2), I finally kicked off a reassign_partitions admin task
> > > adding
> > > > broker 7 to the replicas list for partition 0 which finally fixed the
> > > under
> > > > replicated issue:
> > > >
> > > > partition: 0 - leader: 3  expected_leader: 3  brokers: [3, 4, 7]
> > > in-sync:
> > > > [3, 4, 7]
> > > >
> > > > Is this therefore expected that the user will fix up the under
> > > replication
> > > > situation? Or maybe it is expected again that broker 5 will come back
> > > soon
> > > > and this whole thing is a non-issue once that's true given that
> > > > decommissioning brokers is not something supported as of the kafka
> > > version
> > > > I am using.
> > > >
> > > > Another thing I'd like to clarify is that for another topic Y,
> broker 5
> > > was
> > > > never removed from the ISR array. Note that Y is an unused topic so I
> > am
> > > > guessing that technically broker 5 is not out of sync... though it is
> > > still
> > > > dead. Is this the expected behavior?
> > > >
> > > > I'd really appreciate somebody to confirm my understanding,
> > > >
> > > > Thanks,
> > > >
> > >
> >
>

Re: Broker brought down and under replicated partitions

Reply via email to