[ https://issues.apache.org/jira/browse/KAFKA-15844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
José Armando García Sancio updated KAFKA-15844: ----------------------------------------------- Summary: Broker doesn't re-register after losing ZK session (was: Broker does doesn't re-register after losing ZK session) > Broker doesn't re-register after losing ZK session > -------------------------------------------------- > > Key: KAFKA-15844 > URL: https://issues.apache.org/jira/browse/KAFKA-15844 > Project: Kafka > Issue Type: Bug > Affects Versions: 3.1.2 > Reporter: José Armando García Sancio > Priority: Major > > We experienced a case where a Kafka broker lost connection to the ZK cluster > and was not able to recreate the registration znode. Only, after the broker > was restarted did the registration znode get created. > My impression is that the following code is not correct. This code assumes > that the ZK client as connect right after creating the ZooKeeper client. It > doesn't wait for the session state to be marked as connected. > {code:java} > private def reinitialize(): Unit = { > // Initialization callbacks are invoked outside of the lock to avoid > deadlock potential since their completion > // may require additional Zookeeper requests, which will block to > acquire the initialization lock > stateChangeHandlers.values.foreach(callBeforeInitializingSession _) > inWriteLock(initializationLock) { > if (!connectionState.isAlive) { > zooKeeper.close() > info(s"Initializing a new session to $connectString.") > // retry forever until ZooKeeper can be instantiated > var connected = false > while (!connected) { > try { > zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, > ZooKeeperClientWatcher, clientConfig) > connected = true > } catch { > case e: Exception => > info("Error when recreating ZooKeeper, retrying after a short > sleep", e) > Thread.sleep(RetryBackoffMs) > } > } > } > } > stateChangeHandlers.values.foreach(callAfterInitializingSession _) > } > {code} > During broker startup or construction of the {{{}ZooKeeperClient{}}}, it > blocks waiting for the connection state to be marked as connected. > Here is an example session where this happened. The controller sees the > broker go offline: > {code:java} > INFO [Controller id=32] Newly added brokers: , deleted brokers: 37, bounced > brokers: , all live brokers: ...{code} > ZK session is lost in broker 37: > {code:java} > [Broker=37] WARN Client session timed out, have not heard from server in > 3026ms for sessionid 0x504b9c08b5e0025 > ... > INFO [ZooKeeperClient ACL authorizer] Session expired. > ... > INFO [ZooKeeperClient ACL authorizer] Initializing a new session to ... > ... > [Broker=37] INFO Session establishment complete on server ..., sessionid = > 0x604dd0ad7180045, negotiated timeout = 18000{code} > Unfortunately, we never see the broker recreate the broker registration > znode. We never see the following line in the logs: > {code:java} > Creating $path (is it secure? $isSecure){code} > My best guess is that some of the Kafka threads (for example the controller > threads) are block on the ZK client. Unfortunately, I don't have a thread > dump of the process at the time of the issue. > Restarting broker 37 resolved the issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)