I am sorry for that I forgot to tell you the version of kafka, which is kafka 0.11.0.
沈光辉 中国银联 科技事业部 云闪付团队 电话:20633284 | 13696519872 上海市浦东新区顾唐路1699号 中国银联园区 发件人: shenguang...@unionpay.com 发送时间: 2019-11-19 18:59 收件人: users 主题: partition get underreplicated and stuck, descibe command shows the leader is a dead broker id kafka partitions get underreplicated, with a single ISR, and doesn't recover. I have 8 brokers and several topics with 3 replicas for every topic. broker id is from 0 to 7. One day broker 0 got a young gc for 3.29 seconds and after that some partitions reduce its isr from 3 to 1, the log here is: [2019-11-08 13:35:00,821] INFO Partition [dcs_async_redis_to_db,7] on broker 0: Shrinking ISR from 0,1,2 to 0 (kafka.cluster.Partition) [2019-11-08 13:35:00,824] INFO Partition [__consumer_offsets,15] on broker 0: Shrinking ISR from 0,1,2 to 0,1 (kafka.cluster.Partition) there are many timeout exceptions on producers during the gc process. after a while, other 7 brokers say that consistently: [2019-11-08 13:35:24,241] WARN [ReplicaFetcherThread-0-0]: Error in fetch to broker 0, request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={__cons umer_offsets-7=(offset=44372693, logStartOffset=0, maxBytes=1048576), __consumer_offsets-15=(offset=78350976, logStartOffset=0, maxBytes=1048576), dcs_async_redis_to_db-7=(offset=758846267, logStartOffset=757998253, maxBytes=1048576)}) (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 0 was disconnected before the response was read at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:93) at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:93) at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:207) at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:151) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64) the log is a shortcut of broker id 1 which is same like other brokers. what more strange is I tried to kill broker 0 but failed and killed it with -9 finally. and after I killed broker 0, the topic and partition [dcs_async_redis_to_db,7] also showed that its leader is broker 0 when I described the topics status on other broker with --describe command. I am sure that borker of id 0 had been killed at that time. Finally after I restarted the broker 0, the cluster return back to correct status, however there were some accidents during the process, but I think there was nothing related with the trouble what I am confused with. I search the issues of kafka, related some are: https://issues.apache.org/jira/browse/KAFKA-6582 https://issues.apache.org/jira/browse/KAFKA-4477 the issue 4477 shows that it has been fixed but I cannot find commit log or code or patch related. Beggar for your help. I have the kafka logs during the whole time if you want. 沈光辉 中国银联 科技事业部 云闪付团队 电话:20633284 | 13696519872 上海市浦东新区顾唐路1699号 中国银联园区