hi, you are hitting this issue , https://issues.apache.org/jira/browse/KAFKA-4477
On Wed, Dec 28, 2016 at 3:43 PM, Alessandro De Maria < alessandro.dema...@gmail.com> wrote: > Hello, > > I would like to get some help/advise on some issues I am having with my > kafka cluster. > > I am running kafka (kafka_2.11-0.10.1.0) on a 5 broker cluster (ubuntu > 16.04) > > configuration is here: http://pastebin.com/cPch8Kd7 > > today one of the 5 brokers (id: 1) appeared to disconnect from the others: > > The log shows this around that time > [2016-12-28 16:18:30,575] INFO Partition [aki_reload5yl_5,11] on broker 1: > Shrinking ISR for partition [aki_reload5yl_5,11] from 2,3,1 to 1 > (kafka.cluster.Partition) > [2016-12-28 16:18:30,579] INFO Partition [ale_reload5yl_1,0] on broker 1: > Shrinking ISR for partition [ale_reload5yl_1,0] from 5,1,2 to 1 > (kafka.cluster.Partition) > [2016-12-28 16:18:30,580] INFO Partition [hl7_staging,17] on broker 1: > Shrinking ISR for partition [hl7_staging,17] from 4,1,5 to 1 > (kafka.cluster.Partition) > [2016-12-28 16:18:30,581] INFO Partition [hes_reload_5,37] on broker 1: > Shrinking ISR for partition [hes_reload_5,37] from 1,2,5 to 1 > (kafka.cluster.Partition) > [2016-12-28 16:18:30,582] INFO Partition [aki_live,38] on broker 1: > Shrinking ISR for partition [aki_live,38] from 5,2,1 to 1 > (kafka.cluster.Partition) > [2016-12-28 16:18:30,582] INFO Partition [hl7_live,51] on broker 1: > Shrinking ISR for partition [hl7_live,51] from 1,3,4 to 1 > (kafka.cluster.Partition) > > (other hosts had) > java.io.IOException: Connection to 1 was disconnected before the response > was read > at > kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$ > extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115) > at > kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$ > extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112) > at scala.Option.foreach(Option.scala:257) > at > kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$ > extension$1.apply(NetworkClientBlockingOps.scala:112) > at > kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$ > extension$1.apply(NetworkClientBlockingOps.scala:108) > at > kafka.utils.NetworkClientBlockingOps$.recursivePoll$1( > NetworkClientBlockingOps.scala:137) > at > kafka.utils.NetworkClientBlockingOps$.kafka$utils$ > NetworkClientBlockingOps$$pollContinuously$extension( > NetworkClientBlockingOps.scala:143) > at > kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension( > NetworkClientBlockingOps.scala:108) > at > kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala: > 253) > at > kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238) > at > kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42) > at > kafka.server.AbstractFetcherThread.processFetchRequest( > AbstractFetcherThread.scala:118) > at > kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63) > > > while this was happening, the ConsumerOffsetChecker was reporting only few > of the 128 partitions configured for some of the topics, and consumers > started crashing. > > I then used KafkaManager to reassign partitions from broker 1 to other > brokers. > > I could then see on the kafka1 log the following errors > [2016-12-28 17:23:51,816] ERROR [ReplicaFetcherThread-0-4], Error for > partition [aki_live,86] to broker > 4:org.apache.kafka.common.errors.UnknownServerException: The server > experienced an unexpected error when processing the request > (kafka.server.ReplicaFetcherThread) > [2016-12-28 17:23:51,817] ERROR [ReplicaFetcherThread-0-4], Error for > partition [aki_live,21] to broker > 4:org.apache.kafka.common.errors.UnknownServerException: The server > experienced an unexpected error when processing the request > (kafka.server.ReplicaFetcherThread) > [2016-12-28 17:23:51,817] ERROR [ReplicaFetcherThread-0-4], Error for > partition [aki_live,126] to broker > 4:org.apache.kafka.common.errors.UnknownServerException: The server > experienced an unexpected error when processing the request > (kafka.server.ReplicaFetcherThread) > [2016-12-28 17:23:51,818] ERROR [ReplicaFetcherThread-0-4], Error for > partition [aki_live,6] to broker > 4:org.apache.kafka.common.errors.UnknownServerException: The server > experienced an unexpected error when processing the request > (kafka.server.ReplicaFetcherThread) > > > I thought I would restart broker1, but as soon as I did, most of my topic > ended up with some empty partitions, and their consumer offsets were wiped > out completely. > > I understand that because of unclean.leader.election.enable = true an > unclean leader would be elected, but why were the partition wiped out if > there were at least 3 replicas for each? > > What do you thin caused the disconnection in the first place, and how can I > recover from situations like this in the future? > > Regards > Alessandro > > > > > > -- > Alessandro De Maria > alessandro.dema...@gmail.com >