Hi, I just found a reported issue, https://issues.apache.org/jira/browse/KAFKA-4477, hopefully, it's useful for you.
On Thu, Dec 29, 2016 at 12:08 PM, Tony Liu <jiangtao....@zuora.com> wrote: > Hi Thomas or Anyone, > > I also encountered the same issue like you reported, the only workaround > is to restart that broken node, but I did not find root cause how to solve > it right now, so I wonder do you have some progress how to solve that issue > right now? > > i.e, at the beginning, I thought this issue was caused by `ulimit`, after > I increase to 100k, that random error still happen after several days, so I > think this issue may be related with kafka itself. > > thanks. > > On Tue, Dec 6, 2016 at 12:59 PM, Thomas DeVoe <tde...@dataminr.com> wrote: > >> Hi All, >> >> This happened again on our kafka cluster - a single kafka broker seems to >> "forget" the existence of the rest of the cluster and shrinks all of its >> ISRs to only exist on that node. The other two nodes get stuck in a loop >> trying to connect to this rogue node and never even register that it is no >> longer part of the cluster. Strangely network connection between all of >> these nodes is fine at that time and restarting the node resolves it >> (though with some data loss due to unclean leader elections) >> >> Anyone have any ideas? Help would be greatly appreciated. >> >> Thanks, >> >> >> >> <http://dataminr.com/> >> >> >> *Tom DeVoe* >> Sofware Engineer, Data >> >> 6 East 32nd Street, 2nd Floor >> New York, NY 10016 >> >> >> >> Dataminr is a Twitter Official Partner. >> Dataminr in the news: The Economist >> <http://www.economist.com/news/business/21705369-alternative >> -data-firms-are-shedding-new-light-corporate-performance-watchers> >> | International Business Times >> <http://www.ibtimes.co.uk/dataminr-solves-twitters-needle- >> haystack-problem-hedge-funds-banks-1576692> >> | Equities.com >> <https://www.equities.com/news/from-novelty-to-utility-how- >> dataminr-and-the-alternative-data-industry-is-becoming-mainstream> >> | SIA >> <http://newsmanager.commpartners.com/sianews2/issues/2016-08-19/11.html> >> >> >> On Tue, Nov 29, 2016 at 1:29 PM, Thomas DeVoe <tde...@dataminr.com> >> wrote: >> >> > Hi, >> > >> > I encountered a strange issue in our kafka cluster, where randomly a >> > single broker entered a state where it seemed to think it was the only >> > broker in the cluster (it shrank all of its ISRs to just existing on >> > itself). Some details about the kafka cluster: >> > >> > - running in an EC2 VPC on AWS >> > - 3 nodes (d2.xlarge) >> > - Kafka version : 0.10.1.0 >> > >> > More information about the incident: >> > >> > Around 19:57 yesterday, one of the nodes somehow lost its connection to >> > the cluster and started reporting messages like this for what seemed to >> be >> > all of its hosted topic partitions: >> > >> > [2016-11-28 19:57:05,426] INFO Partition [arches_stage,0] on broker >> 1002: >> >> Shrinking ISR for partition [arches_stage,0] from 1003,1002,1001 to >> 1002 >> >> (kafka.cluster.Partition) >> >> [2016-11-28 19:57:05,466] INFO Partition [connect-offsets,13] on broker >> >> 1002: Shrinking ISR for partition [connect-offsets,13] from >> 1003,1002,1001 >> >> to 1002 (kafka.cluster.Partition) >> >> [2016-11-28 19:57:05,489] INFO Partition [lasagna_prod_memstore,2] on >> >> broker 1002: Shrinking ISR for partition [lasagna_prod_memstore,2] from >> >> 1003,1002,1001 to 1002 (kafka.cluster.Partition) >> >> ... >> >> >> > >> > It then added the ISRs from the other machines back in: >> > >> > [2016-11-28 19:57:18,013] INFO Partition [arches_stage,0] on broker >> 1002: >> >> Expanding ISR for partition [arches_stage,0] from 1002 to 1002,1003 >> >> (kafka.cluster.Partition) >> >> [2016-11-28 19:57:18,015] INFO Partition [connect-offsets,13] on broker >> >> 1002: Expanding ISR for partition [connect-offsets,13] from 1002 to >> >> 1002,1003 (kafka.cluster.Partition) >> >> [2016-11-28 19:57:18,018] INFO Partition [lasagna_prod_memstore,2] on >> >> broker 1002: Expanding ISR for partition [lasagna_prod_memstore,2] from >> >> 1002 to 1002,1003 (kafka.cluster.Partition) >> >> ... >> >> [2016-11-28 19:57:18,222] INFO Partition [arches_stage,0] on broker >> 1002: >> >> Expanding ISR for partition [arches_stage,0] from 1002,1003 to >> >> 1002,1003,1001 (kafka.cluster.Partition) >> >> [2016-11-28 19:57:18,224] INFO Partition [connect-offsets,13] on broker >> >> 1002: Expanding ISR for partition [connect-offsets,13] from 1002,1003 >> to >> >> 1002,1003,1001 (kafka.cluster.Partition) >> >> [2016-11-28 19:57:18,227] INFO Partition [lasagna_prod_memstore,2] on >> >> broker 1002: Expanding ISR for partition [lasagna_prod_memstore,2] from >> >> 1002,1003 to 1002,1003,1001 (kafka.cluster.Partition) >> > >> > >> > and eventually removed them again before going on its merry way: >> > >> > [2016-11-28 19:58:05,408] INFO Partition [arches_stage,0] on broker >> 1002: >> >> Shrinking ISR for partition [arches_stage,0] from 1002,1003,1001 to >> 1002 >> >> (kafka.cluster.Partition) >> >> [2016-11-28 19:58:05,415] INFO Partition [connect-offsets,13] on broker >> >> 1002: Shrinking ISR for partition [connect-offsets,13] from >> 1002,1003,1001 >> >> to 1002 (kafka.cluster.Partition) >> >> [2016-11-28 19:58:05,416] INFO Partition [lasagna_prod_memstore,2] on >> >> broker 1002: Shrinking ISR for partition [lasagna_prod_memstore,2] from >> >> 1002,1003,1001 to 1002 (kafka.cluster.Partition) >> > >> > >> > Node 1002 continued running from that point on normally (outside of the >> > fact that all of it's partitions were under replicated). Also there >> were no >> > WARN/ERROR before/after this. >> > >> > >> > The other two nodes were not so happy however, with both failing to >> > connect to via the ReplicaFetcherThread to the node in question. The >> > reported this around the same time as that error: >> > >> > [2016-11-28 19:57:16,087] WARN [ReplicaFetcherThread-0-1002], Error in >> >> fetch kafka.server.ReplicaFetcherThread$FetchRequest@6eb44718 >> >> (kafka.server.ReplicaFetcherThread) >> >> java.io.IOException: Connection to 1002 was disconnected before the >> >> response was read >> >> at kafka.utils.NetworkClientBlockingOps$$anonfun$ >> >> blockingSendAndReceive$extension$1$$anonfun$apply$1.apply( >> >> NetworkClientBlockingOps.scala:115) >> >> at kafka.utils.NetworkClientBlockingOps$$anonfun$ >> >> blockingSendAndReceive$extension$1$$anonfun$apply$1.apply( >> >> NetworkClientBlockingOps.scala:112) >> >> at scala.Option.foreach(Option.scala:257) >> >> at kafka.utils.NetworkClientBlockingOps$$anonfun$ >> >> blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps. >> >> scala:112) >> >> at kafka.utils.NetworkClientBlockingOps$$anonfun$ >> >> blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps. >> >> scala:108) >> >> at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1( >> >> NetworkClientBlockingOps.scala:137) >> >> at kafka.utils.NetworkClientBlockingOps$.kafka$utils$ >> >> NetworkClientBlockingOps$$pollContinuously$extension( >> >> NetworkClientBlockingOps.scala:143) >> >> at kafka.utils.NetworkClientBlock >> ingOps$.blockingSendAndReceive$ >> >> extension(NetworkClientBlockingOps.scala:108) >> >> at kafka.server.ReplicaFetcherThread.sendRequest( >> >> ReplicaFetcherThread.scala:253) >> >> at kafka.server.ReplicaFetcherThread.fetch( >> >> ReplicaFetcherThread.scala:238) >> >> at kafka.server.ReplicaFetcherThread.fetch( >> >> ReplicaFetcherThread.scala:42) >> >> at kafka.server.AbstractFetcherThread.processFetchRequest( >> >> AbstractFetcherThread.scala:118) >> >> at kafka.server.AbstractFetcherThread.doWork( >> >> AbstractFetcherThread.scala:103) >> >> at kafka.utils.ShutdownableThread.run( >> >> ShutdownableThread.scala:63) >> > >> > >> > and then got stuck trying this every 30 seconds until I restarted node >> > 1002: >> > >> > [2016-11-28 20:02:04,513] WARN [ReplicaFetcherThread-0-1002], Error in >> >> fetch kafka.server.ReplicaFetcherThread$FetchRequest@1cd61a02 >> >> (kafka.server.ReplicaFetcherThread) >> >> java.net.SocketTimeoutException: Failed to connect within 30000 ms >> >> at kafka.server.ReplicaFetcherThread.sendRequest( >> >> ReplicaFetcherThread.scala:249) >> >> at kafka.server.ReplicaFetcherThread.fetch( >> >> ReplicaFetcherThread.scala:238) >> >> at kafka.server.ReplicaFetcherThread.fetch( >> >> ReplicaFetcherThread.scala:42) >> >> at kafka.server.AbstractFetcherThread.processFetchRequest( >> >> AbstractFetcherThread.scala:118) >> >> at kafka.server.AbstractFetcherThread.doWork( >> >> AbstractFetcherThread.scala:103) >> >> at kafka.utils.ShutdownableThread.run( >> >> ShutdownableThread.scala:63) >> > >> > >> > I restarted the node when I noticed this, however because the replicas >> > were out of sync, we ended up having an unclean leader election and >> > ultimately losing data for the partitions on that machine. Some >> potentially >> > interesting things about the cluster state at that time: >> > >> > - I *was* able to telnet to port 9092 on the machine that went rogue >> from >> > each of the other two kafka brokers (even while they were reporting >> failed >> > to connect) >> > - The number of open file descriptors on that machine started increased >> > linearly for the entire 1.5 hours the cluster was in this state, >> eventually >> > ending up at ~4x the usual open file descriptors. The number of open >> file >> > descriptors went back to normal after the restart. >> > - The heap size on the node in question started fluctuating very >> rapidly. >> > The usual behavior is the heap size slowly grows over a period of ~10 >> hours >> > and then I assume a large GC occurs and it starts this again. The node >> that >> > had this issue had the period of that behavior drop to ~5 mins. >> > - The heap size spiked to a size way higher than normal >> > - While the node was in this state the System/Process CPU dropped to >> > ~1/8th of its usual level. >> > >> > I have the full logs and more metrics collected for all 3 nodes for that >> > time period and would be happy to share them, but I wasn't sure if the >> user >> > list supported attachments. >> > >> > Any help would be greatly appreciated. >> > >> > Thanks, >> > >> > >> > <http://dataminr.com/> >> > >> > >> > *Tom DeVoe* >> > Sofware Engineer, Data >> > >> > 6 East 32nd Street, 2nd Floor >> > New York, NY 10016 >> > >> > >> > >> > Dataminr is a Twitter Official Partner. >> > Dataminr in the news: The Economist >> > <http://www.economist.com/news/business/21705369-alternative >> -data-firms-are-shedding-new-light-corporate-performance-watchers> >> > | International Business Times >> > <http://www.ibtimes.co.uk/dataminr-solves-twitters-needle- >> haystack-problem-hedge-funds-banks-1576692> >> > | Equities.com >> > <https://www.equities.com/news/from-novelty-to-utility-how- >> dataminr-and-the-alternative-data-industry-is-becoming-mainstream> >> > | SIA >> > <http://newsmanager.commpartners.com/sianews2/issues/2016-08-19/11.html >> > >> > >> > >> > >