Daniel Fonai created KAFKA-18874: ------------------------------------ Summary: KRaft controller does not retry registration if the first attempt times out Key: KAFKA-18874 URL: https://issues.apache.org/jira/browse/KAFKA-18874 Project: Kafka Issue Type: Bug Reporter: Daniel Fonai
There is a [retry mechanism|https://github.com/apache/kafka/blob/3.9.0/core/src/main/scala/kafka/server/ControllerRegistrationManager.scala#L274] with exponential backoff built-in in KRaft controller registration. The timeout of the first attempt is 5 s for KRaft controllers ([code|https://github.com/apache/kafka/blob/3.9.0/core/src/main/scala/kafka/server/ControllerServer.scala#L448]) which is not configurable. If for some reason the controller's first registration request times out, the attempt should be retried but in practice this does not happen and the controller is not able to join the quorum. We see the following in the faulty controller's log: {noformat} 2025-02-21 13:31:46,606 INFO [ControllerRegistrationManager id=3 incarnation=mEzjHheAQ_eDWejAFquGiw] sendControllerRegistration: attempting to send ControllerRegistrationRequestData(controllerId=3, incarnationId=mEzjHheAQ_eDWejAFquGiw, zkMigrationReady=true, listeners=[Listener(name='CONTROLPLANE-9090', host='kraft-rollback-kafka-controller-pool-3.kraft-rollback-kafka-kafka-brokers.csm-op-test-kraft-rollback-631e64ac.svc', port=9090, securityProtocol=1)], features=[Feature(name='kraft.version', minSupportedVersion=0, maxSupportedVersion=1), Feature(name='metadata.version', minSupportedVersion=1, maxSupportedVersion=21)]) (kafka.server.ControllerRegistrationManager) [controller-3-registration-manager-event-handler] ... 2025-02-21 13:31:51,627 ERROR [ControllerRegistrationManager id=3 incarnation=mEzjHheAQ_eDWejAFquGiw] RegistrationResponseHandler: channel manager timed out before sending the request. (kafka.server.ControllerRegistrationManager) [controller-3-to-controller-registration-channel-manager] 2025-02-21 13:31:51,726 INFO [ControllerRegistrationManager id=3 incarnation=mEzjHheAQ_eDWejAFquGiw] maybeSendControllerRegistration: waiting for the previous RPC to complete. (kafka.server.ControllerRegistrationManager) [controller-3-registration-manager-event-handler] {noformat} After this we can not see any controller retry in the log. -- This message was sent by Atlassian Jira (v8.20.10#820010)