GeoffreyStark created KAFKA-12665: ------------------------------------- Summary: one of brokers which is also controller has too much CLOSE_WAITE Key: KAFKA-12665 URL: https://issues.apache.org/jira/browse/KAFKA-12665 Project: Kafka Issue Type: Bug Components: clients, consumer, controller, core Affects Versions: 0.11.0.1 Reporter: GeoffreyStark Attachments: image-2021-04-14-10-32-54-140.png, image-2021-04-14-10-39-02-996.png, image-2021-04-14-11-26-03-346.png
# *enviroment* apache- 0.11.0.1 5 nodes 3 replicator mean message per sec : 4k Prometheus & jmxProt & grafana cosumer : spring boot& Doris routineLoad producer: spring boo& Log # *encounter with* we encounter with a broker (id : 4)which is also controller (epoch 90)having much CLOSE_WAITE at a time controller.log {code:java} Controller 4 epoch 90 fails to send request (type: UpdateMetadataRequest ... java.io.IOException: Connection to 4 was disconnected before the response was read {code} !image-2021-04-14-10-32-54-140.png! It will be retried many, many times, but the WARNING will not change At the same time another broker 6 fetching message from the broker 4 also encounter with the problem {code:java} [2021-04-13 16:35:06,942] WARN [ReplicaFetcherThread-0-4]: Error in fetch to broker 4, request (type=FetchRequest, replicaId=6, maxWait=500, minBytes=1, maxBytes=10485760, java.io.IOException: Connection to 4 was disconnected before the response was read {code} !image-2021-04-14-10-39-02-996.png! doris routineLoad(consume from kafka) time out {code:java} 2021-04-13 16:35:11,397 WARN (Routine load scheduler|42) [KafkaUtil.getAllKafkaPartitions():91] failed to get partitions. org.apache.doris.common.UserException: errCode = 2, detailMessage = failed to get kafka partition info: [failed to get partition meta: Local: Timed out] {code} broker 4( controller 90) fs.file !image-2021-04-14-11-26-03-346.png! Most of the CLOSE_WAITE is generated by the consumer application At 16:49, the broker was restarted and returned to normal *# speculation* The TCP connection is closed passively, but the processing of the Controller Broker machine is not responding Are there any bugs in this version? -- This message was sent by Atlassian Jira (v8.3.4#803005)