David Hoffman created KAFKA-13388: ------------------------------------- Summary: Kafka Producer has no timeout for nodes stuck in CHECKING_API_VERSIONS Key: KAFKA-13388 URL: https://issues.apache.org/jira/browse/KAFKA-13388 Project: Kafka Issue Type: Bug Components: core Reporter: David Hoffman
I have been seeing expired batch errors in my app. {code:java} org.apache.kafka.common.errors.TimeoutException: Expiring 51 record(s) for xxx-17:120002 ms has passed since batch creation {code} I would have assumed a request timout or connection timeout should have also been logged. I could not find any other associated errors. I added some instrumenting to my app and have traced this down to broker connections hanging in CHECKING_API_VERSIONS state. It appears there is no effective timeout for Kafka Producer broker connections in CHECKING_API_VERSIONS state. In the code see the after the NetworkClient connects to a broker node it makes a request to check api versions, when it receives the response it marks the node as ready. I am seeing that sometimes a reply is not received for the check api versions request the connection just hangs in CHECKING_API_VERSIONS state until it is disposed I assume after the idle connection timeout. I am guessing the connection setup timeout should be still in play for this, but it is not. There is a connectingNodes set that is consulted when checking timeouts and the node is removed when ClusterConnectionStates.checkingApiVersions(String id) is called to transition the node into CHECKING_API_VERSIONS -- This message was sent by Atlassian Jira (v8.3.4#803005)