Hi, Ismael and Jan, Thanks a lot for your prompt responses!
> Is inter.broker.protocol.version set correctly in brokers 1 and 2? It > should be 0.10.0 so that they can talk to the older broker without issue. I set it on the broker #2, but it doesnt't seem to work. > The only option I know of is to reboot the affected broker. And upgrade to 0.10.1.1 as quickly as possible. We haven't seen this issue on 0.10.1.1.RC0. I'm using: https://hub.docker.com/r/wurstmeister/kafka/tags/ and there's no such version. Neither does the website proposes to download it. Is 0.10.1.1 not considered stable yet? I'm not sure about using it... Maybe downgrade would work? Re: restarting the faulty broker. As I understand, to avoid losing data, I'd have to shut down other parts of the cluster first, right? -Valentin On Thu, Dec 22, 2016 at 9:01 PM, Jan Omar <janamory.o...@gmail.com> wrote: > Unfortunately I think you hit this bug: > > https://issues.apache.org/jira/browse/KAFKA-4477 < > https://issues.apache.org/jira/browse/KAFKA-4477> > > The only option I know of is to reboot the affected broker. And upgrade to > 0.10.1.1 as quickly as possible. We haven't seen this issue on 0.10.1.1.RC0. > > Regards > > Jan > > > > On 22 Dec 2016, at 18:16, Ismael Juma <ism...@juma.me.uk> wrote: > > > > Hi Valentin, > > > > Is inter.broker.protocol.version set correctly in brokers 1 and 2? It > > should be 0.10.0 so that they can talk to the older broker without issue. > > > > Ismael > > > > On Thu, Dec 22, 2016 at 4:42 PM, Valentin Golev < > valentin.go...@gdeslon.ru> > > wrote: > > > >> Hello, > >> > >> I have a three broker Kafka setup (the ids are 1, 2 (kafka 0.10.1.0) and > >> 1001 (kafka 0.10.0.0)). After a failure of two of them a lot of the > >> partitions have the third one (1001) as their leader. It's like this: > >> > >> Topic: userevents0.open Partition: 5 Leader: 1 Replicas: > >> 1,2,1001 Isr: 1,1001,2 > >> Topic: userevents0.open Partition: 6 Leader: 2 Replicas: > >> 2,1,1001 Isr: 1,2,1001 > >> Topic: userevents0.open Partition: 7 Leader: 1001 Replicas: > >> 1001,2,1 Isr: 1001 > >> Topic: userevents0.open Partition: 8 Leader: 1 Replicas: > >> 1,1001,2 Isr: 1,1001,2 > >> Topic: userevents0.open Partition: 9 Leader: 1001 Replicas: > >> 2,1001,1 Isr: 1001 > >> Topic: userevents0.open Partition: 10 Leader: 1001 Replicas: > >> 1001,1,2 Isr: 1001 > >> > >> As you can see, only the partitions with Leaders 1 or 2 have > successfully > >> replicated. Brokers 1 and 2, however, are unable to fetch data from the > >> 1001. > >> > >> All of the partitions are available to the consumers and producers. So > >> everything is fine except replication. 1001 is available from the other > >> servers. > >> > >> I can't restart the broker 1001 because it seems that it will cause data > >> loss (as you can see, it's the only ISR on many partitions). Restarting > the > >> other brokers didn't help at all. Neither did just plain waiting (it's > the > >> third day of this going on). So what do I do? > >> > >> The logs of the broker 2 (the one which tries to fetch data) are full of > >> this: > >> > >> [2016-12-22 16:38:52,199] WARN [ReplicaFetcherThread-0-1001], Error in > >> fetch kafka.server.ReplicaFetcherThread$FetchRequest@117a49bf > >> (kafka.server.ReplicaFetcherThread) > >> java.io.IOException: Connection to 1001 was disconnected before the > >> response was read > >> at > >> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$ > >> extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115) > >> at > >> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$ > >> extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112) > >> at scala.Option.foreach(Option.scala:257) > >> at > >> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$ > >> extension$1.apply(NetworkClientBlockingOps.scala:112) > >> at > >> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$ > >> extension$1.apply(NetworkClientBlockingOps.scala:108) > >> at > >> kafka.utils.NetworkClientBlockingOps$.recursivePoll$1( > >> NetworkClientBlockingOps.scala:137) > >> at > >> kafka.utils.NetworkClientBlockingOps$.kafka$utils$ > >> NetworkClientBlockingOps$$pollContinuously$extension( > >> NetworkClientBlockingOps.scala:143) > >> at > >> kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension( > >> NetworkClientBlockingOps.scala:108) > >> at > >> kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcher > Thread.scala: > >> 253) > >> at > >> kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238) > >> at > >> kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42) > >> at > >> kafka.server.AbstractFetcherThread.processFetchRequest( > >> AbstractFetcherThread.scala:118) > >> at > >> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThr > ead.scala:103) > >> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala: > 63) > >> > >> The logs of the broker 1001 are full of this: > >> > >> [2016-12-22 16:38:54,226] ERROR Processor got uncaught exception. > >> (kafka.network.Processor) > >> java.nio.BufferUnderflowException > >> at java.nio.Buffer.nextGetIndex(Buffer.java:506) > >> at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:361) > >> at > >> kafka.api.FetchRequest$$anonfun$1$$anonfun$apply$1. > >> apply(FetchRequest.scala:55) > >> at > >> kafka.api.FetchRequest$$anonfun$1$$anonfun$apply$1. > >> apply(FetchRequest.scala:52) > >> at > >> scala.collection.TraversableLike$$anonfun$map$ > >> 1.apply(TraversableLike.scala:234) > >> at > >> scala.collection.TraversableLike$$anonfun$map$ > >> 1.apply(TraversableLike.scala:234) > >> at scala.collection.immutable.Range.foreach(Range.scala:160) > >> at > >> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > >> at scala.collection.AbstractTraversable.map(Traversable.scala: > 104) > >> at kafka.api.FetchRequest$$anonfun$1.apply(FetchRequest.scala: > 52) > >> at kafka.api.FetchRequest$$anonfun$1.apply(FetchRequest.scala: > 49) > >> at > >> scala.collection.TraversableLike$$anonfun$flatMap$1.apply( > >> TraversableLike.scala:241) > >> at > >> scala.collection.TraversableLike$$anonfun$flatMap$1.apply( > >> TraversableLike.scala:241) > >> at scala.collection.immutable.Range.foreach(Range.scala:160) > >> at > >> scala.collection.TraversableLike$class.flatMap(TraversableLi > ke.scala:241) > >> at > >> scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > >> at kafka.api.FetchRequest$.readFrom(FetchRequest.scala:49) > >> at > >> kafka.network.RequestChannel$Request$$anonfun$2.apply( > >> RequestChannel.scala:65) > >> at > >> kafka.network.RequestChannel$Request$$anonfun$2.apply( > >> RequestChannel.scala:65) > >> at > >> kafka.network.RequestChannel$Request$$anonfun$4.apply( > >> RequestChannel.scala:71) > >> at > >> kafka.network.RequestChannel$Request$$anonfun$4.apply( > >> RequestChannel.scala:71) > >> at scala.Option.map(Option.scala:146) > >> at > >> kafka.network.RequestChannel$Request.<init>(RequestChannel.scala:71) > >> at > >> kafka.network.Processor$$anonfun$processCompletedReceives$1. > >> apply(SocketServer.scala:488) > >> at > >> kafka.network.Processor$$anonfun$processCompletedReceives$1. > >> apply(SocketServer.scala:483) > >> at scala.collection.Iterator$class.foreach(Iterator.scala:893) > >> at scala.collection.AbstractIterator.foreach(Iterator.scala: > 1336) > >> at > >> scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > >> at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > >> at > >> kafka.network.Processor.processCompletedReceives(SocketServe > r.scala:483) > >> at kafka.network.Processor.run(SocketServer.scala:413) > >> at java.lang.Thread.run(Thread.java:745) > >> > >