Thanks Tony.

How can I check the number of pending requests and the idle percentage
of the requests handler on node 6? Are there any metrics for those?

Controller logs confirm that the controller is not able to send
updateMetadataRequest() to that particular node acting as the follower.
Whereas to all other nodes, it is able to send the request and get the
response. But with node 6, before it gets the response, the channel is
closed. Hence we see the error logs mentioned in the earlier thread.

Also looks like we hit this
<https://issues.apache.org/jira/browse/KAFKA-5153> deadlock situation. In
the next version, it has been fixed.

Solutions tried are:

   - Restart kafka service on that node or follower which is not syncing
   with the leader(node 6 as per example). Didn't help.
   - Invoke controller election by removing controller znode in zookeeper.
   Didn't help.
   - Restart the machine itself. Ironically, this worked for us! After the
   restart, the controller was able to send updateMetadataRequest to node and
   node started syncing with the leader. Took some time to be in-sync but it
   worked.

Thanks,
Suman
Thanks,
Suman

On Sun, Dec 9, 2018 at 11:53 AM Tony Liu <jiangtao....@zuora.com.invalid>
wrote:

> you mentioned:
>
>    1. broker disconnection error, normally the case I have ever seen is
>    when some broker is busy and can not response connection quickly to
> other
>    replicas.
>    2. partitions under-replicated, normally pint to some broker may have a
>    performance issue.
>    3. 90% under-replicated partitions have the same node, let 's say the
>    broker id is 6.
>
> that gives me some idea your broker with id 6 may have some bottleneck, so
> can you also check the number of pending requests and the idle percentage
> of the requests handler on node 6?
>
> thanks.
>
> ty
>
> On Sat, Dec 8, 2018 at 9:02 PM Suman B N <sumannew...@gmail.com> wrote:
>
> > Still hoping for some help here.
> >
> > On Fri, Dec 7, 2018 at 12:24 AM Suman B N <sumannew...@gmail.com> wrote:
> >
> > > Guys,
> > > Another observation is 90% of under-replicated partitions have the same
> > > node as the follower.
> > >
> > > *Any help in here is very much appreciated. We have very less time to
> > > stabilize kafka. Thanks a lot in advance.*
> > >
> > > -Suman
> > >
> > > On Thu, Dec 6, 2018 at 9:08 PM Suman B N <sumannew...@gmail.com>
> wrote:
> > >
> > >> +users
> > >>
> > >> On Thu, Dec 6, 2018 at 9:01 PM Suman B N <sumannew...@gmail.com>
> wrote:
> > >>
> > >>> Team,
> > >>>
> > >>> We are observing ISR shrink and expand very frequently. In the logs
> of
> > >>> the follower, below errors are observed:
> > >>>
> > >>> [2018-12-06 20:00:42,709] WARN [ReplicaFetcherThread-2-15], Error in
> > >>> fetch kafka.server.ReplicaFetcherThread$FetchRequest@a0f9ba9
> > >>> (kafka.server.ReplicaFetcherThread)
> > >>> java.io.IOException: Connection to 15 was disconnected before the
> > >>> response was read
> > >>>         at
> > >>>
> >
> kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3(NetworkClientBlockingOps.scala:114)
> > >>>         at
> > >>>
> >
> kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3$adapted(NetworkClientBlockingOps.scala:112)
> > >>>         at scala.Option.foreach(Option.scala:257)
> > >>>         at
> > >>>
> >
> kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$1(NetworkClientBlockingOps.scala:112)
> > >>>         at
> > >>>
> >
> kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:136)
> > >>>         at
> > >>>
> >
> kafka.utils.NetworkClientBlockingOps$.pollContinuously$extension(NetworkClientBlockingOps.scala:142)
> > >>>         at
> > >>>
> >
> kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
> > >>>         at
> > >>>
> >
> kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:249)
> > >>>         at
> > >>>
> kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:234)
> > >>>         at
> > >>>
> kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
> > >>>         at
> > >>>
> >
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
> > >>>         at
> > >>>
> >
> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
> > >>>         at
> > >>> kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
> > >>>
> > >>> Can someone explain this? And help us understand how we can resolve
> > >>> these under-replicated partitions.
> > >>>
> > >>> server.properties file:
> > >>> broker.id=15
> > >>> port=9092
> > >>> zookeeper.connect=zk1,zk2,zk3,zk4,zk5,zk6
> > >>>
> > >>> default.replication.factor=2
> > >>> log.dirs=/data/kafka
> > >>> delete.topic.enable=true
> > >>> zookeeper.session.timeout.ms=10000
> > >>> inter.broker.protocol.version=0.10.2
> > >>> num.partitions=3
> > >>> min.insync.replicas=1
> > >>> log.retention.ms=259200000
> > >>> message.max.bytes=20971520
> > >>> replica.fetch.max.bytes=20971520
> > >>> replica.fetch.response.max.bytes=20971520
> > >>> max.partition.fetch.bytes=20971520
> > >>> fetch.max.bytes=20971520
> > >>> log.flush.interval.ms=5000
> > >>> log.roll.hours=24
> > >>> num.replica.fetchers=3
> > >>> num.io.threads=8
> > >>> num.network.threads=6
> > >>> log.message.format.version=0.9.0.1
> > >>>
> > >>> Also In what cases we lead to this state? We have 1200-1400 topics
> and
> > >>> 5000-6000 partitions spread across 20 node cluster. But only 30-40
> > >>> partitions are under-replicated while rest are in-sync. 95% of these
> > >>> partitions are having 2 replication factor.
> > >>>
> > >>> --
> > >>> *Suman*
> > >>>
> > >>
> > >>
> > >> --
> > >> *Suman*
> > >> *OlaCabs*
> > >>
> > >
> > >
> > > --
> > > *Suman*
> > > *OlaCabs*
> > >
> >
> >
> > --
> > *Suman*
> > *OlaCabs*
> >
>


-- 
*Suman*
*OlaCabs*

Reply via email to