you mentioned:

   1. broker disconnection error, normally the case I have ever seen is
   when some broker is busy and can not response connection quickly to other
   replicas.
   2. partitions under-replicated, normally pint to some broker may have a
   performance issue.
   3. 90% under-replicated partitions have the same node, let 's say the
   broker id is 6.

that gives me some idea your broker with id 6 may have some bottleneck, so
can you also check the number of pending requests and the idle percentage
of the requests handler on node 6?

thanks.

ty

On Sat, Dec 8, 2018 at 9:02 PM Suman B N <sumannew...@gmail.com> wrote:

> Still hoping for some help here.
>
> On Fri, Dec 7, 2018 at 12:24 AM Suman B N <sumannew...@gmail.com> wrote:
>
> > Guys,
> > Another observation is 90% of under-replicated partitions have the same
> > node as the follower.
> >
> > *Any help in here is very much appreciated. We have very less time to
> > stabilize kafka. Thanks a lot in advance.*
> >
> > -Suman
> >
> > On Thu, Dec 6, 2018 at 9:08 PM Suman B N <sumannew...@gmail.com> wrote:
> >
> >> +users
> >>
> >> On Thu, Dec 6, 2018 at 9:01 PM Suman B N <sumannew...@gmail.com> wrote:
> >>
> >>> Team,
> >>>
> >>> We are observing ISR shrink and expand very frequently. In the logs of
> >>> the follower, below errors are observed:
> >>>
> >>> [2018-12-06 20:00:42,709] WARN [ReplicaFetcherThread-2-15], Error in
> >>> fetch kafka.server.ReplicaFetcherThread$FetchRequest@a0f9ba9
> >>> (kafka.server.ReplicaFetcherThread)
> >>> java.io.IOException: Connection to 15 was disconnected before the
> >>> response was read
> >>>         at
> >>>
> kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3(NetworkClientBlockingOps.scala:114)
> >>>         at
> >>>
> kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3$adapted(NetworkClientBlockingOps.scala:112)
> >>>         at scala.Option.foreach(Option.scala:257)
> >>>         at
> >>>
> kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$1(NetworkClientBlockingOps.scala:112)
> >>>         at
> >>>
> kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:136)
> >>>         at
> >>>
> kafka.utils.NetworkClientBlockingOps$.pollContinuously$extension(NetworkClientBlockingOps.scala:142)
> >>>         at
> >>>
> kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
> >>>         at
> >>>
> kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:249)
> >>>         at
> >>> kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:234)
> >>>         at
> >>> kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
> >>>         at
> >>>
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
> >>>         at
> >>>
> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
> >>>         at
> >>> kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
> >>>
> >>> Can someone explain this? And help us understand how we can resolve
> >>> these under-replicated partitions.
> >>>
> >>> server.properties file:
> >>> broker.id=15
> >>> port=9092
> >>> zookeeper.connect=zk1,zk2,zk3,zk4,zk5,zk6
> >>>
> >>> default.replication.factor=2
> >>> log.dirs=/data/kafka
> >>> delete.topic.enable=true
> >>> zookeeper.session.timeout.ms=10000
> >>> inter.broker.protocol.version=0.10.2
> >>> num.partitions=3
> >>> min.insync.replicas=1
> >>> log.retention.ms=259200000
> >>> message.max.bytes=20971520
> >>> replica.fetch.max.bytes=20971520
> >>> replica.fetch.response.max.bytes=20971520
> >>> max.partition.fetch.bytes=20971520
> >>> fetch.max.bytes=20971520
> >>> log.flush.interval.ms=5000
> >>> log.roll.hours=24
> >>> num.replica.fetchers=3
> >>> num.io.threads=8
> >>> num.network.threads=6
> >>> log.message.format.version=0.9.0.1
> >>>
> >>> Also In what cases we lead to this state? We have 1200-1400 topics and
> >>> 5000-6000 partitions spread across 20 node cluster. But only 30-40
> >>> partitions are under-replicated while rest are in-sync. 95% of these
> >>> partitions are having 2 replication factor.
> >>>
> >>> --
> >>> *Suman*
> >>>
> >>
> >>
> >> --
> >> *Suman*
> >> *OlaCabs*
> >>
> >
> >
> > --
> > *Suman*
> > *OlaCabs*
> >
>
>
> --
> *Suman*
> *OlaCabs*
>

Reply via email to