I am sorry for that I forgot to tell you the version of kafka, which is kafka 
0.11.0.



沈光辉 
中国银联 科技事业部 云闪付团队
电话:20633284 | 13696519872
上海市浦东新区顾唐路1699号 中国银联园区

 
发件人: shenguang...@unionpay.com
发送时间: 2019-11-19 18:59
收件人: users
主题: partition get underreplicated and stuck, descibe command shows the leader 
is a dead broker id

kafka partitions get underreplicated, with a single ISR, and doesn't recover. 
I have 8 brokers and several topics with 3 replicas for every topic. broker id 
is from 0 to 7. One day broker 0 got a young gc for 3.29 seconds and after that 
some partitions reduce its isr from 3 to 1, the log here is:

[2019-11-08 13:35:00,821] INFO Partition [dcs_async_redis_to_db,7] on broker 0: 
Shrinking ISR from 0,1,2 to 0 (kafka.cluster.Partition)
[2019-11-08 13:35:00,824] INFO Partition [__consumer_offsets,15] on broker 0: 
Shrinking ISR from 0,1,2 to 0,1 (kafka.cluster.Partition)
 there are many timeout exceptions on producers during the gc process. after a 
while,  other 7 brokers say that consistently:

[2019-11-08 13:35:24,241] WARN [ReplicaFetcherThread-0-0]: Error in fetch to 
broker 0, request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, 
maxBytes=10485760, fetchData={__cons
umer_offsets-7=(offset=44372693, logStartOffset=0, maxBytes=1048576), 
__consumer_offsets-15=(offset=78350976, logStartOffset=0, maxBytes=1048576), 
dcs_async_redis_to_db-7=(offset=758846267,
 logStartOffset=757998253, maxBytes=1048576)}) 
(kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 0 was disconnected before the response was 
read
        at 
org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:93)
        at 
kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:93)
        at 
kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:207)
        at 
kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
        at 
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:151)
        at 
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)

the log is a shortcut of broker id 1 which is same like other brokers.  what 
more strange is I tried to kill broker 0 but failed and killed it with -9 
finally.  and after I killed broker 0, the topic and partition  
[dcs_async_redis_to_db,7]  also showed that its leader is broker 0 when I 
described the topics status on other broker with --describe command. I am sure 
that borker of id 0 had been killed at that time. Finally after I restarted the 
broker 0, the cluster return back to correct status, however there were some 
accidents during the process, but I think there was nothing related with the 
trouble what I am confused with.

I search the issues of kafka, related some are:
https://issues.apache.org/jira/browse/KAFKA-6582
https://issues.apache.org/jira/browse/KAFKA-4477

the issue 4477 shows that it has been fixed but I cannot find commit log or 
code or patch related. Beggar for your help. I have the kafka logs during the 
whole time if you want.



沈光辉 
中国银联 科技事业部 云闪付团队
电话:20633284 | 13696519872
上海市浦东新区顾唐路1699号 中国银联园区

Reply via email to