gc pauses (in milliseconds) for all brokers in the cluster for two minutes around one recent episode: min max count 7 8 51 6 8 49 6 9 53 <-- broker removed from isr 6 9 55 7 9 54 7 10 52 6 11 56 7 11 54 6 12 58 7 14 52 10 18 53 <-- leader of affected partitions 12 24 56
So, all the pauses are brief (low tens of ms), and I wouldn't say the affected brokers are outliers wrt the unaffected brokers. On Fri, May 3, 2019 at 12:20 AM Yz Xyz <oracl...@gmail.com> wrote: > > Hi Steven - what are pause times, gc and memory usage on Kafka brokers? > When pause times are high kafka broker disconnects from zk. goes out of > cluster and rejoins later makes shrink and expand on isr. > > > > On May 2, 2019, at 5:39 PM, Steven Taschuk <stasc...@stripe.com.invalid> > wrote: > > > > Our kafka broker logs show episodes where several partitions have > > "Shrinking ISR ..." messages followed, usually less than 2 seconds > > later, by corresponding "Expanding ISR ..." messages that restore the > > original set of brokers for all the partitions. Does anyone have any > > suggestions on how to investigate this? > > > > On our main cluster, consisting of 12 brokers running kafka 2.0.1, > > with 2500-3000 partitions in 300-400 topics (each partition having 3 > > replicas), we see about 10 such episodes a day, each involving > > typically 5-8 partitions. The set of partitions varies but often > > repeats or nearly repeats; to illustrate that, here's the partitions > > affected in one broker's episodes over the past week: > > 2019-04-27T17:51 > > > status-4,dining-11,tax-0,education-0,government-7,locker-19,credit-11,law-15 > > 2019-04-27T18:01 > family-14,health-13,golf-2,phone-14,news-15,peace-13 > > 2019-04-27T22:59 > > stock-13,income-2,insurance-0,college-0,district-1,breast-8,back-41 > > 2019-04-28T02:35 mg-12,executive-18,nursing-12 > > 2019-04-28T02:51 mg-12,executive-18,nursing-12 > > 2019-04-28T08:34 > > health-25,living-10,supra-8,death-15,drug-3,talk-12,cell-0 > > 2019-04-28T12:15 > > health-25,living-10,supra-8,death-15,drug-3,talk-12,cell-0 > > 2019-04-28T19:03 > > health-25,living-10,supra-8,death-15,drug-3,talk-12,cell-0 > > 2019-04-28T20:16 > > health-25,living-10,supra-8,death-15,drug-3,talk-12,cell-0 > > 2019-04-29T18:16 > > > climate-11,lemon-12,faculty-22,side-4,music-12,police-0,room-0,press-8,parking-3,subject-10,blood-3 > > 2019-04-30T22:54 living-10,death-15,drug-3,talk-12,cell-12 > > 2019-05-01T16:52 > > oil-4,community-5,ice-7,public-2,substance-0,grocery-9,carbon-27,g-12 > > 2019-05-01T17:01 > > child-16,community-5,ice-7,public-2,grocery-9,carbon-27,task-3,g-12 > > 2019-05-01T17:36 > community-5,ice-7,public-2,grocery-9,carbon-27,g-12 > > 2019-05-01T22:14 > > school-9,interest-2,kitchen-0,hotel-3,carbon-3,heart-14 > > (I've replaced the actual topic names with common words.) > > > > We see similar behaviour on another cluster, consisting of 7 brokers > > running kafka 2.2.0. > > > > We haven't found anything unusual in the surrounding logs, or in > > metrics about the network and disk activity of the brokers. > > > > Some similar-looking issues from jira: > > https://issues.apache.org/jira/browse/KAFKA-4003 > > similar in that the expand happens within a second or two of the > shrink > > https://issues.apache.org/jira/browse/KAFKA-4674 > > https://issues.apache.org/jira/browse/KAFKA-3916 > > both of these involve disconnections, which we don't see > > https://issues.apache.org/jira/browse/KAFKA-7152 > > talks about a constant churn of shrink/expand, which we don't see > > (also, was fixed in 2.1.0, and we see it on 2.2.0) >