[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697002#comment-14697002 ]
Flavio Junqueira commented on KAFKA-1387: ----------------------------------------- I'm actually really sorry that this issue has been around for so long, I didn't realize it was going on and that I was even indirectly participating in it. Let me start by giving a sort of general overview of what to expect. If a client has received a session expiration event, it means that the leader has expired the session and has broadcast the closeSession event to the followers. If the same client creates a new session successfully, then the server it connects to must have applied the previous closeSession, which deletes the ephemeral znodes, because ZK guarantees that txns are totally ordered. Consequently, the client shouldn't observe an ephemeral from an old session of its own. Note that another client could still observe the ephemeral znode after the session expiration if it is connected to a server that is a bit behind, but that's fine. What I'm thinking is that one problem that could happen is that a client creates a new session before receiving the session expiration for an earlier session. In that case the ephemerals will still be there because the session still exists. The bottom line is that if the client has seen the session expiration event, then it seems fine to go ahead and create new ephemerals without having to check whether ephemerals are stale or not. If the session creation isn't clean, then there are a few options like waiting for the timeout period, storing and recovering the session id. I'll dig into the code to see how we can fix this, have a closer look at the patch, and will reopen the associated ZOOKEEPER-1740 issue until we sort this out. let me know if the explanation above makes sense in the meanwhile. > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > --------------------------------------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8.1.1 > Reporter: Fedor Korotkiy > Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)