hi,

you are hitting this issue ,
https://issues.apache.org/jira/browse/KAFKA-4477

On Wed, Dec 28, 2016 at 3:43 PM, Alessandro De Maria <
alessandro.dema...@gmail.com> wrote:

> Hello,
>
> I would like to get some help/advise on some issues I am having with my
> kafka cluster.
>
> I am running kafka (kafka_2.11-0.10.1.0) on a 5 broker cluster (ubuntu
> 16.04)
>
> configuration is here: http://pastebin.com/cPch8Kd7
>
> today one of the 5 brokers (id: 1) appeared to disconnect from the others:
>
> The log shows this around that time
> [2016-12-28 16:18:30,575] INFO Partition [aki_reload5yl_5,11] on broker 1:
> Shrinking ISR for partition [aki_reload5yl_5,11] from 2,3,1 to 1
> (kafka.cluster.Partition)
> [2016-12-28 16:18:30,579] INFO Partition [ale_reload5yl_1,0] on broker 1:
> Shrinking ISR for partition [ale_reload5yl_1,0] from 5,1,2 to 1
> (kafka.cluster.Partition)
> [2016-12-28 16:18:30,580] INFO Partition [hl7_staging,17] on broker 1:
> Shrinking ISR for partition [hl7_staging,17] from 4,1,5 to 1
> (kafka.cluster.Partition)
> [2016-12-28 16:18:30,581] INFO Partition [hes_reload_5,37] on broker 1:
> Shrinking ISR for partition [hes_reload_5,37] from 1,2,5 to 1
> (kafka.cluster.Partition)
> [2016-12-28 16:18:30,582] INFO Partition [aki_live,38] on broker 1:
> Shrinking ISR for partition [aki_live,38] from 5,2,1 to 1
> (kafka.cluster.Partition)
> [2016-12-28 16:18:30,582] INFO Partition [hl7_live,51] on broker 1:
> Shrinking ISR for partition [hl7_live,51] from 1,3,4 to 1
> (kafka.cluster.Partition)
>
> (other hosts had)
> java.io.IOException: Connection to 1 was disconnected before the response
> was read
>         at
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$
> extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
>         at
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$
> extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
>         at scala.Option.foreach(Option.scala:257)
>         at
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$
> extension$1.apply(NetworkClientBlockingOps.scala:112)
>         at
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$
> extension$1.apply(NetworkClientBlockingOps.scala:108)
>         at
> kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(
> NetworkClientBlockingOps.scala:137)
>         at
> kafka.utils.NetworkClientBlockingOps$.kafka$utils$
> NetworkClientBlockingOps$$pollContinuously$extension(
> NetworkClientBlockingOps.scala:143)
>         at
> kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(
> NetworkClientBlockingOps.scala:108)
>         at
> kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:
> 253)
>         at
> kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
>         at
> kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
>         at
> kafka.server.AbstractFetcherThread.processFetchRequest(
> AbstractFetcherThread.scala:118)
>         at
> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
>         at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
>
>
> while this was happening, the ConsumerOffsetChecker was reporting only few
> of the 128 partitions configured for some of the topics, and consumers
> started crashing.
>
> I then used KafkaManager to reassign partitions from broker 1 to other
> brokers.
>
> I could then see on the kafka1 log the following errors
> [2016-12-28 17:23:51,816] ERROR [ReplicaFetcherThread-0-4], Error for
> partition [aki_live,86] to broker
> 4:org.apache.kafka.common.errors.UnknownServerException: The server
> experienced an unexpected error when processing the request
> (kafka.server.ReplicaFetcherThread)
> [2016-12-28 17:23:51,817] ERROR [ReplicaFetcherThread-0-4], Error for
> partition [aki_live,21] to broker
> 4:org.apache.kafka.common.errors.UnknownServerException: The server
> experienced an unexpected error when processing the request
> (kafka.server.ReplicaFetcherThread)
> [2016-12-28 17:23:51,817] ERROR [ReplicaFetcherThread-0-4], Error for
> partition [aki_live,126] to broker
> 4:org.apache.kafka.common.errors.UnknownServerException: The server
> experienced an unexpected error when processing the request
> (kafka.server.ReplicaFetcherThread)
> [2016-12-28 17:23:51,818] ERROR [ReplicaFetcherThread-0-4], Error for
> partition [aki_live,6] to broker
> 4:org.apache.kafka.common.errors.UnknownServerException: The server
> experienced an unexpected error when processing the request
> (kafka.server.ReplicaFetcherThread)
>
>
> I thought I would restart broker1, but as soon as I did, most of my topic
> ended up with some empty partitions, and their consumer offsets were wiped
> out completely.
>
> I understand that because of unclean.leader.election.enable = true an
> unclean leader would be elected, but why were the partition wiped out if
> there were at least 3 replicas for each?
>
> What do you thin caused the disconnection in the first place, and how can I
> recover from situations like this in the future?
>
> Regards
> Alessandro
>
>
>
>
>
> --
> Alessandro De Maria
> alessandro.dema...@gmail.com
>

Reply via email to