[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

Flavio Junqueira (JIRA) Fri, 14 Aug 2015 06:35:38 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697002#comment-14697002
 ]


Flavio Junqueira commented on KAFKA-1387:
-----------------------------------------

I'm actually really sorry that this issue has been around for so long, I didn't 
realize it was going on and that I was even indirectly participating in it. Let 
me start by giving a sort of general overview of what to expect.

If a client has received a session expiration event, it means that the leader 
has expired the session and has broadcast the closeSession event to the 
followers. If the same client creates a new session successfully, then the 
server it connects to must have applied the previous closeSession, which 
deletes the ephemeral znodes, because ZK guarantees that txns are totally 
ordered. Consequently, the client shouldn't observe an ephemeral from an old 
session of its own. Note that another client could still observe the ephemeral 
znode after the session expiration if it is connected to a server that is a bit 
behind, but that's fine.

What I'm thinking is that one problem that could happen is that a client 
creates a new session before receiving the session expiration for an earlier 
session. In that case the ephemerals will still be there because the session 
still exists.

The bottom line is that if the client has seen the session expiration event, 
then it seems fine to go ahead and create new ephemerals without having to 
check whether ephemerals are stale or not. If the session creation isn't clean, 
then there are a few options like waiting for the timeout period, storing and 
recovering the session id.

I'll dig into the code to see how we can fix this, have a closer look at the 
patch, and will reopen the associated ZOOKEEPER-1740 issue until we sort this 
out. let me know if the explanation above makes sense in the meanwhile. 

> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1387
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1387
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.1.1
>            Reporter: Fedor Korotkiy
>            Priority: Blocker
>              Labels: newbie, patch, zkclient-problems
>         Attachments: kafka-1387.patch
>
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

Reply via email to