Cameron created KAFKA-15766: ------------------------------- Summary: Possible request handler thread exhaustion in controller election Key: KAFKA-15766 URL: https://issues.apache.org/jira/browse/KAFKA-15766 Project: Kafka Issue Type: Bug Affects Versions: 3.5.0 Reporter: Cameron
Hello, This is my first dive into the Kafka source code, so I may be mis-interpreting some of the logic outlined below. Please let me know if my understanding is not correct. --- After upgrading a relatively large Kafka cluster from IPB 2.5 to 3.5 (Kafka version v3.5.0), we've had numerous problems with controller failover and subsequent elections not being properly acknowledged by a subset of the cluster's brokers. These affected brokers will temporarily become unresponsive due to complete request handler thread starvation via failed ALLOCATE_PRODUCER_IDS and ALTER_PARTITION requests to the old stopped controller. In the case of a disconnected response, as seen here, the broker-to-controller channel will place these failed requests at the beginning of the queue, effectively deadlocking the request handler thread pool. Depending on how quickly the configured queue.max.requests request queue fills, the new controller's UPDATE_METADATA request to notify the broker of the controller election result may not even be able to contact the affected broker to update the new controller Id. For context, the relevant configuration of my cluster is: * num.network.threads = 16 * num.io.threads = 8 * queued.max.requests = 500 I am not able to consistently repro the issue, but it happens frequently enough to cause major pain in our set up. The general order of events are: # Controller election takes place (due to intentional failover or otherwise) # New controller successfully registers its’ znode with Zookeeper, initiates UPDATE_METADATA request to notify other brokers in the cluster # Problematic broker, noticing it has lost connection to the old controller but has not yet received and/or processed the incoming UPDATE_METADATA request, spams (every ~50ms as hardcoded) ALLOCATE_PRODUCER_IDS and ALTER_PARTITION requests to the old controller, receiving disconnect responses and placing the request back in the front of the request queue. In this scenario, the needed UPDATE_METADATA request will take much longer to process. In our specific cluster this will sometimes auto-resolve in 3-5m whereas other times it will require manual intervention to bounce the broker and allow it to initialize a connection with the new controller. This issue seems very similar to some previous tickets I found: * https://issues.apache.org/jira/browse/KAFKA-7697 * https://issues.apache.org/jira/browse/KAFKA-14694 Unfortunately I have not been able to grab a threaddump because of the sporadic nature of the issue, but will update the ticket if I am able. This may also affect other versions of Kafka, but we've only experienced it with v3.5.0 + IBP 3.5. Curiously enough we haven't ran into this issue with v3.5.0 + IBP 2.5, assumedly because the ALLOCATE_PRODUCER_IDS and ALTER_PARTITION features are not yet enabled until IBP 2.7 and 3.0, respectively. Is there any guidance on whether this failure path is possible as described and what steps can be taken to mitigate the issue? Thank you! -- This message was sent by Atlassian Jira (v8.20.10#820010)