Our kafka broker logs show episodes where several partitions have "Shrinking ISR ..." messages followed, usually less than 2 seconds later, by corresponding "Expanding ISR ..." messages that restore the original set of brokers for all the partitions. Does anyone have any suggestions on how to investigate this?
On our main cluster, consisting of 12 brokers running kafka 2.0.1, with 2500-3000 partitions in 300-400 topics (each partition having 3 replicas), we see about 10 such episodes a day, each involving typically 5-8 partitions. The set of partitions varies but often repeats or nearly repeats; to illustrate that, here's the partitions affected in one broker's episodes over the past week: 2019-04-27T17:51 status-4,dining-11,tax-0,education-0,government-7,locker-19,credit-11,law-15 2019-04-27T18:01 family-14,health-13,golf-2,phone-14,news-15,peace-13 2019-04-27T22:59 stock-13,income-2,insurance-0,college-0,district-1,breast-8,back-41 2019-04-28T02:35 mg-12,executive-18,nursing-12 2019-04-28T02:51 mg-12,executive-18,nursing-12 2019-04-28T08:34 health-25,living-10,supra-8,death-15,drug-3,talk-12,cell-0 2019-04-28T12:15 health-25,living-10,supra-8,death-15,drug-3,talk-12,cell-0 2019-04-28T19:03 health-25,living-10,supra-8,death-15,drug-3,talk-12,cell-0 2019-04-28T20:16 health-25,living-10,supra-8,death-15,drug-3,talk-12,cell-0 2019-04-29T18:16 climate-11,lemon-12,faculty-22,side-4,music-12,police-0,room-0,press-8,parking-3,subject-10,blood-3 2019-04-30T22:54 living-10,death-15,drug-3,talk-12,cell-12 2019-05-01T16:52 oil-4,community-5,ice-7,public-2,substance-0,grocery-9,carbon-27,g-12 2019-05-01T17:01 child-16,community-5,ice-7,public-2,grocery-9,carbon-27,task-3,g-12 2019-05-01T17:36 community-5,ice-7,public-2,grocery-9,carbon-27,g-12 2019-05-01T22:14 school-9,interest-2,kitchen-0,hotel-3,carbon-3,heart-14 (I've replaced the actual topic names with common words.) We see similar behaviour on another cluster, consisting of 7 brokers running kafka 2.2.0. We haven't found anything unusual in the surrounding logs, or in metrics about the network and disk activity of the brokers. Some similar-looking issues from jira: https://issues.apache.org/jira/browse/KAFKA-4003 similar in that the expand happens within a second or two of the shrink https://issues.apache.org/jira/browse/KAFKA-4674 https://issues.apache.org/jira/browse/KAFKA-3916 both of these involve disconnections, which we don't see https://issues.apache.org/jira/browse/KAFKA-7152 talks about a constant churn of shrink/expand, which we don't see (also, was fixed in 2.1.0, and we see it on 2.2.0)