[ 
https://issues.apache.org/jira/browse/KAFKA-15844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

José Armando García Sancio updated KAFKA-15844:
-----------------------------------------------
    Summary: Broker does doesn't re-register after loosing ZK session  (was: 
Broker does re-register)

> Broker does doesn't re-register after loosing ZK session
> --------------------------------------------------------
>
>                 Key: KAFKA-15844
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15844
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 3.1.2
>            Reporter: José Armando García Sancio
>            Priority: Major
>
> We experienced a case where a Kafka broker lost connection to the ZK cluster 
> and was not able to recreate the registration znode. Only, after the broker 
> was restarted did the registration znode get created.
> My impression is that the following code is not correct. This code assumes 
> that the ZK client as connect right after creating the ZooKeeper client. It 
> doesn't wait for the session state to be marked as connected.
> {code:java}
>      private def reinitialize(): Unit = {
>       // Initialization callbacks are invoked outside of the lock to avoid 
> deadlock potential since their completion
>       // may require additional Zookeeper requests, which will block to 
> acquire the initialization lock
>       stateChangeHandlers.values.foreach(callBeforeInitializingSession _)
>       inWriteLock(initializationLock) {
>         if (!connectionState.isAlive) {
>           zooKeeper.close()
>           info(s"Initializing a new session to $connectString.")
>           // retry forever until ZooKeeper can be instantiated
>           var connected = false
>           while (!connected) {
>             try {
>               zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, 
> ZooKeeperClientWatcher, clientConfig)
>               connected = true
>             } catch {
>               case e: Exception =>
>                 info("Error when recreating ZooKeeper, retrying after a short 
> sleep", e)
>                 Thread.sleep(RetryBackoffMs)
>             }
>           }
>         }
>       }
>       stateChangeHandlers.values.foreach(callAfterInitializingSession _)
>     }
> {code}
> During broker startup or construction of the {{{}ZooKeeperClient{}}}, it 
> blocks waiting for the connection state to be marked as connected.
> Here is an example session where this happened. The controller sees the 
> broker go offline:
> {code:java}
> INFO [Controller id=32] Newly added brokers: , deleted brokers: 37, bounced 
> brokers: , all live brokers: ...{code}
> ZK session is lost in broker 37:
> {code:java}
> [Broker=37] WARN Client session timed out, have not heard from server in 
> 3026ms for sessionid 0x504b9c08b5e0025
> ...
> INFO [ZooKeeperClient ACL authorizer] Session expired.
> ...
> INFO [ZooKeeperClient ACL authorizer] Initializing a new session to ...
> ...
> [Broker=37] INFO Session establishment complete on server ..., sessionid = 
> 0x604dd0ad7180045, negotiated timeout = 18000{code}
> Unfortunately, we never see the broker recreate the broker registration 
> znode. We never see the following line in the logs:
> {code:java}
> Creating $path (is it secure? $isSecure){code}
> My best guess is that some of the Kafka threads (for example the controller 
> threads) are block on the ZK client. Unfortunately, I don't have a thread 
> dump of the process at the time of the issue.
> Restarting broker 37 resolved the issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to