Rishank Chandra Puram created KAFKA-10250:
---------------------------------------------

             Summary: Kafka broker shrinks the ISRs and disconnects from other 
brokers for few seconds
                 Key: KAFKA-10250
                 URL: https://issues.apache.org/jira/browse/KAFKA-10250
             Project: Kafka
          Issue Type: Bug
          Components: build, config, consumer, controller, log, producer 
    Affects Versions: 2.0.0
            Reporter: Rishank Chandra Puram


The following the summary/overview of the whole issue. Can you please help us 
look into the below and let us know you thoughts on what caused the issue? And 
how to mitigate this in the future

Issue:
1.      All of a sudden all the other brokers in cluster report have issues 
with one of the broker as below
            
Error for good broker [broker: broker2, brokerID: 1030]
[2020-06-27 16:14:53,367] WARN [ReplicaFetcher replicaId=1030, leaderId=1029, 
fetcherId=13] Error in response for fetch request (type=FetchRequest, 
replicaId=1030, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={}, 
isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1901394956, 
epoch=1018699)) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 1029 was disconnected before the response 
was read
        at 
org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
        at 
kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:96)
        at 
kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:240)
        at 
kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:43)
        at 
kafka.server.AbstractFetcherThread.prabcssFetchRequest(AbstractFetcherThread.scala:149)
        at 
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:114)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
 
Error from affected broker [broker: broker6, brokerID: 1029]
[2020-06-27 16:14:25,744] INFO [Partition topic_test2-33 broker=1029] Shrinking 
ISR from 1025,1029,1030 to 1025,1029 (kafka.cluster.Partition)
[2020-06-27 16:14:25,752] INFO [Partition topic_a_restate-34 broker=1029] 
Shrinking ISR from 1027,1029,1028 to 1029 (kafka.cluster.Partition)
[2020-06-27 16:14:25,760] INFO [Partition topic_f_restate-39 broker=1029] 
Shrinking ISR from 1026,1029,1025 to 1029,1025 (kafka.cluster.Partition)
[2020-06-27 16:14:25,772] INFO [Partition topic_test2-16 broker=1029] Shrinking 
ISR from 1028,1029,1030 to 1029,1030 (kafka.cluster.Partition)
[2020-06-27 16:14:26,683] INFO [ProducerStateManager partition=topic_abc_f-21] 
Writing producer snapshot at offset 1509742173 (kafka.log.ProducerStateManager)
[2020-06-27 16:14:26,683] INFO [Log partition=topic_abc_f-21, 
dir=/hadoop-e/kafka/data1] Rolled new log segment at offset 1509742173 in 1 ms. 
(kafka.log.Log)
[2020-06-27 16:14:55,375] INFO [Partition topic_test2-33 broker=1029] Expanding 
ISR from 1025,1029 to 1025,1029,1030 (kafka.cluster.Partition)
[2020-06-27 16:14:55,387] INFO [Partition test-topic-analysis-9 broker=1029] 
Expanding ISR from 1028,1029 to 1028,1029,1030 (kafka.cluster.Partition)
 
2.      Entire kafka cluster becomes stable within few minutes
 
Trace for good broker [broker: broker2, brokerID: 1030]
[2020-06-27 16:20:14,861] INFO Deleted time index 
/hadoop-g/kafka/data1/topic-analysis-0/00000000009100172512.timeindex.deleted. 
(kafka.log.LogSegment)
[2020-06-27 16:20:14,882] INFO [Log partition=topic-analysis-4, 
dir=/hadoop-b/kafka/data1] Deleting segment 9100010843 (kafka.log.Log)
[2020-06-27 16:20:14,883] INFO Deleted log 
/hadoop-b/kafka/data1/topic-analysis-4/00000000009100010843.log.deleted. 
(kafka.log.LogSegment)
[2020-06-27 16:20:14,897] INFO Deleted offset index 
/hadoop-b/kafka/data1/topic-analysis-4/00000000009100010843.index.deleted. 
(kafka.log.LogSegment)
[2020-06-27 16:20:14,897] INFO Deleted time index 
/hadoop-b/kafka/data1/topic-analysis-4/00000000009100010843.timeindex.deleted. 
(kafka.log.LogSegment)
 
Trace from affected broker [broker: broker6, brokerID: 1029]
[2020-06-27 16:21:01,552] INFO [Log partition=topic-analysis-22, 
dir=/hadoop-i/kafka/data1] Scheduling log segment [baseOffset 9099830344, size 
1073733482] for deletion. (kafka.log.Log)
[2020-06-27 16:21:01,553] INFO [Log partition=topic-analysis-22, 
dir=/hadoop-i/kafka/data1] Scheduling log segment [baseOffset 9100082067, size 
1073738790] for deletion. (kafka.log.Log)
[2020-06-27 16:21:01,553] INFO [Log partition=topic-analysis-22, 
dir=/hadoop-i/kafka/data1] Incrementing log start offset to 9100334713 
(kafka.log.Log)
[2020-06-27 16:21:01,581] INFO Cleared earliest 0 entries from epoch cache 
based on passed offset 9100334713 leaving 1 in EpochFile for partition 
topic-analysis-22 (kafka.server.epoch.LeaderEpochFileCache)
[2020-06-27 16:22:00,175] INFO [Log partition=topic_def_c-1, 
dir=/hadoop-j/kafka/data1] Deleting segment 1628853412 (kafka.log.Log)
[2020-06-27 16:22:00,175] INFO Deleted log 
/hadoop-j/kafka/data1/topic_def_c-1/00000000001628853412.log.deleted. 
(kafka.log.LogSegment)
 
3.      Once the disconnect would happen all the producer jobs will fail 
stating "Caused by: org.apache.kafka.common.errors.TimeoutException: Batch 
containing 1 record(s) expired due to timeout while requesting metadata from 
brokers for topic-analysis-10"
a.      If we describe the topics, all the topic partitions which the jobs are 
complaining would have the leader on the affected broker [here in this case 
global_auth-10 has leader on broker 1029]
4.      This would continue until the affected broker was restarted. Once the 
broker is restarted, all the jobs would run fine
5.      During the time in between the broker getting affected and we restart 
the broker
a.      We see lot of ISR shrinks and Expansion happening as per Grafana. This 
ISR expansions/shrinks would be oserved until the affected broker was 
restarted. But the Shrinks/Expands cannot be  seen in logs.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to