yes, Kafka provides those metrics for monitoring the pending requests and idle request handler, check their documents, you should get it, otherwise, let me know.
from the operations you did: 1. Restarting service is not help should be expected, since that broker is already being a bottleneck. 2. Force controller election, I will not do that, since there is no controller issue. 3. Restart server (id = 6) is working. for #3 is working, basically, that gives me clues: 1. Restart server, that means to clear away all of the pending requests or not finished work from broker 6. 2. Restart server, it will trigger partition leader rebalance, the others help take over some traffic. In term of the #1 & #2, though it's working by now, I doubt that server will run into bottleneck again after a few days. By the way, you can spend some time comparing broker 6 with other servers, to see any possible you have the hardware, network configuration issues, On Sun, Dec 9, 2018 at 1:22 AM Suman B N <sumannew...@gmail.com> wrote: > Thanks Tony. > > How can I check the number of pending requests and the idle percentage > of the requests handler on node 6? Are there any metrics for those? > > Controller logs confirm that the controller is not able to send > updateMetadataRequest() to that particular node acting as the follower. > Whereas to all other nodes, it is able to send the request and get the > response. But with node 6, before it gets the response, the channel is > closed. Hence we see the error logs mentioned in the earlier thread. > > Also looks like we hit this > <https://issues.apache.org/jira/browse/KAFKA-5153> deadlock situation. In > the next version, it has been fixed. > > Solutions tried are: > > - Restart kafka service on that node or follower which is not syncing > with the leader(node 6 as per example). Didn't help. > - Invoke controller election by removing controller znode in zookeeper. > Didn't help. > - Restart the machine itself. Ironically, this worked for us! After the > restart, the controller was able to send updateMetadataRequest to node > and > node started syncing with the leader. Took some time to be in-sync but > it > worked. > > Thanks, > Suman > Thanks, > Suman > > On Sun, Dec 9, 2018 at 11:53 AM Tony Liu <jiangtao....@zuora.com.invalid> > wrote: > > > you mentioned: > > > > 1. broker disconnection error, normally the case I have ever seen is > > when some broker is busy and can not response connection quickly to > > other > > replicas. > > 2. partitions under-replicated, normally pint to some broker may have > a > > performance issue. > > 3. 90% under-replicated partitions have the same node, let 's say the > > broker id is 6. > > > > that gives me some idea your broker with id 6 may have some bottleneck, > so > > can you also check the number of pending requests and the idle percentage > > of the requests handler on node 6? > > > > thanks. > > > > ty > > > > On Sat, Dec 8, 2018 at 9:02 PM Suman B N <sumannew...@gmail.com> wrote: > > > > > Still hoping for some help here. > > > > > > On Fri, Dec 7, 2018 at 12:24 AM Suman B N <sumannew...@gmail.com> > wrote: > > > > > > > Guys, > > > > Another observation is 90% of under-replicated partitions have the > same > > > > node as the follower. > > > > > > > > *Any help in here is very much appreciated. We have very less time to > > > > stabilize kafka. Thanks a lot in advance.* > > > > > > > > -Suman > > > > > > > > On Thu, Dec 6, 2018 at 9:08 PM Suman B N <sumannew...@gmail.com> > > wrote: > > > > > > > >> +users > > > >> > > > >> On Thu, Dec 6, 2018 at 9:01 PM Suman B N <sumannew...@gmail.com> > > wrote: > > > >> > > > >>> Team, > > > >>> > > > >>> We are observing ISR shrink and expand very frequently. In the logs > > of > > > >>> the follower, below errors are observed: > > > >>> > > > >>> [2018-12-06 20:00:42,709] WARN [ReplicaFetcherThread-2-15], Error > in > > > >>> fetch kafka.server.ReplicaFetcherThread$FetchRequest@a0f9ba9 > > > >>> (kafka.server.ReplicaFetcherThread) > > > >>> java.io.IOException: Connection to 15 was disconnected before the > > > >>> response was read > > > >>> at > > > >>> > > > > > > kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3(NetworkClientBlockingOps.scala:114) > > > >>> at > > > >>> > > > > > > kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3$adapted(NetworkClientBlockingOps.scala:112) > > > >>> at scala.Option.foreach(Option.scala:257) > > > >>> at > > > >>> > > > > > > kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$1(NetworkClientBlockingOps.scala:112) > > > >>> at > > > >>> > > > > > > kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:136) > > > >>> at > > > >>> > > > > > > kafka.utils.NetworkClientBlockingOps$.pollContinuously$extension(NetworkClientBlockingOps.scala:142) > > > >>> at > > > >>> > > > > > > kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108) > > > >>> at > > > >>> > > > > > > kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:249) > > > >>> at > > > >>> > > kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:234) > > > >>> at > > > >>> > > kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42) > > > >>> at > > > >>> > > > > > > kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118) > > > >>> at > > > >>> > > > > > > kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103) > > > >>> at > > > >>> kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63) > > > >>> > > > >>> Can someone explain this? And help us understand how we can resolve > > > >>> these under-replicated partitions. > > > >>> > > > >>> server.properties file: > > > >>> broker.id=15 > > > >>> port=9092 > > > >>> zookeeper.connect=zk1,zk2,zk3,zk4,zk5,zk6 > > > >>> > > > >>> default.replication.factor=2 > > > >>> log.dirs=/data/kafka > > > >>> delete.topic.enable=true > > > >>> zookeeper.session.timeout.ms=10000 > > > >>> inter.broker.protocol.version=0.10.2 > > > >>> num.partitions=3 > > > >>> min.insync.replicas=1 > > > >>> log.retention.ms=259200000 > > > >>> message.max.bytes=20971520 > > > >>> replica.fetch.max.bytes=20971520 > > > >>> replica.fetch.response.max.bytes=20971520 > > > >>> max.partition.fetch.bytes=20971520 > > > >>> fetch.max.bytes=20971520 > > > >>> log.flush.interval.ms=5000 > > > >>> log.roll.hours=24 > > > >>> num.replica.fetchers=3 > > > >>> num.io.threads=8 > > > >>> num.network.threads=6 > > > >>> log.message.format.version=0.9.0.1 > > > >>> > > > >>> Also In what cases we lead to this state? We have 1200-1400 topics > > and > > > >>> 5000-6000 partitions spread across 20 node cluster. But only 30-40 > > > >>> partitions are under-replicated while rest are in-sync. 95% of > these > > > >>> partitions are having 2 replication factor. > > > >>> > > > >>> -- > > > >>> *Suman* > > > >>> > > > >> > > > >> > > > >> -- > > > >> *Suman* > > > >> *OlaCabs* > > > >> > > > > > > > > > > > > -- > > > > *Suman* > > > > *OlaCabs* > > > > > > > > > > > > > -- > > > *Suman* > > > *OlaCabs* > > > > > > > > -- > *Suman* > *OlaCabs* >