GeoffreyStark created KAFKA-12665:
-------------------------------------

             Summary: one of brokers which is also controller has too much 
CLOSE_WAITE
                 Key: KAFKA-12665
                 URL: https://issues.apache.org/jira/browse/KAFKA-12665
             Project: Kafka
          Issue Type: Bug
          Components: clients, consumer, controller, core
    Affects Versions: 0.11.0.1
            Reporter: GeoffreyStark
         Attachments: image-2021-04-14-10-32-54-140.png, 
image-2021-04-14-10-39-02-996.png, image-2021-04-14-11-26-03-346.png

# *enviroment*

apache- 0.11.0.1

5 nodes

3 replicator

mean message per sec : 4k

Prometheus & jmxProt & grafana

cosumer : spring boot& Doris routineLoad

producer: spring boo& Log 

 

# *encounter with*

 we encounter with a broker (id : 4)which is also controller (epoch 90)having 
much CLOSE_WAITE  at a time 

controller.log

 
{code:java}
Controller 4 epoch 90 fails to send request (type: UpdateMetadataRequest ...
java.io.IOException: Connection to 4 was disconnected before the response was 
read
{code}
 

!image-2021-04-14-10-32-54-140.png!

It will be retried many, many times, but the WARNING will not change

 

At the same time

another broker 6  fetching message from the broker 4 also encounter with the 
problem
{code:java}
[2021-04-13 16:35:06,942] WARN [ReplicaFetcherThread-0-4]: Error in fetch to 
broker 4, request (type=FetchRequest, replicaId=6, maxWait=500, minBytes=1, 
maxBytes=10485760,
java.io.IOException: Connection to 4 was disconnected before the response was 
read
{code}
!image-2021-04-14-10-39-02-996.png!

 

doris routineLoad(consume from kafka) time out

 
{code:java}
2021-04-13 16:35:11,397 WARN (Routine load scheduler|42) 
[KafkaUtil.getAllKafkaPartitions():91] failed to get partitions. 
org.apache.doris.common.UserException: errCode = 2, detailMessage = failed to 
get kafka partition info: [failed to get partition meta: Local: Timed out]
{code}
 

 

broker 4( controller 90) fs.file

!image-2021-04-14-11-26-03-346.png!

Most of the CLOSE_WAITE is generated by the consumer application

At 16:49, the broker was restarted and returned to normal

 

 

*# speculation*

The TCP connection is closed passively, but the processing of the Controller 
Broker machine is not responding

Are there any bugs in this version?

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to