[ https://issues.apache.org/jira/browse/KAFKA-18874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17939589#comment-17939589 ]
Daniel Fonai commented on KAFKA-18874: -------------------------------------- I think we might have found the root cause. There is a flag in ControllerRegistrationManager (pendingRpc) which is set when the controller registration request is queued. In case the request is not sent but timeout occurs, the flag is not reset and and request is not rescheduled: [https://github.com/apache/kafka/blob/c095faa5783b3b2f37f0be590eca7a3ab9aabc99/core/src/main/scala/kafka/server/ControllerRegistrationManager.scala#L211-L214] I created a PR to reset the flag on timeout: [https://github.com/apache/kafka/pull/19321]. > KRaft controller does not retry registration if the first attempt times out > --------------------------------------------------------------------------- > > Key: KAFKA-18874 > URL: https://issues.apache.org/jira/browse/KAFKA-18874 > Project: Kafka > Issue Type: Bug > Components: controller > Affects Versions: 3.9.0 > Reporter: Daniel Fonai > Priority: Minor > Attachments: controller-3.log, controller-4.log, controller-5.log > > > There is a [retry > mechanism|https://github.com/apache/kafka/blob/3.9.0/core/src/main/scala/kafka/server/ControllerRegistrationManager.scala#L274] > with exponential backoff built-in in KRaft controller registration. The > timeout of the first attempt is 5 s for KRaft controllers > ([code|https://github.com/apache/kafka/blob/3.9.0/core/src/main/scala/kafka/server/ControllerServer.scala#L448]) > which is not configurable. > If for some reason the controller's first registration request times out, the > attempt should be retried but in practice this does not happen and the > controller is not able to join the quorum. We see the following in the faulty > controller's log: > {noformat} > 2025-02-21 13:31:46,606 INFO [ControllerRegistrationManager id=3 > incarnation=mEzjHheAQ_eDWejAFquGiw] sendControllerRegistration: attempting to > send ControllerRegistrationRequestData(controllerId=3, > incarnationId=mEzjHheAQ_eDWejAFquGiw, zkMigrationReady=true, > listeners=[Listener(name='CONTROLPLANE-9090', > host='kraft-rollback-kafka-controller-pool-3.kraft-rollback-kafka-kafka-brokers.csm-op-test-kraft-rollback-631e64ac.svc', > port=9090, securityProtocol=1)], features=[Feature(name='kraft.version', > minSupportedVersion=0, maxSupportedVersion=1), > Feature(name='metadata.version', minSupportedVersion=1, > maxSupportedVersion=21)]) (kafka.server.ControllerRegistrationManager) > [controller-3-registration-manager-event-handler] > ... > 2025-02-21 13:31:51,627 ERROR [ControllerRegistrationManager id=3 > incarnation=mEzjHheAQ_eDWejAFquGiw] RegistrationResponseHandler: channel > manager timed out before sending the request. > (kafka.server.ControllerRegistrationManager) > [controller-3-to-controller-registration-channel-manager] > 2025-02-21 13:31:51,726 INFO [ControllerRegistrationManager id=3 > incarnation=mEzjHheAQ_eDWejAFquGiw] maybeSendControllerRegistration: waiting > for the previous RPC to complete. > (kafka.server.ControllerRegistrationManager) > [controller-3-registration-manager-event-handler] > {noformat} > After this we can not see any controller retry in the log. -- This message was sent by Atlassian Jira (v8.20.10#820010)