I see, that makes sense. Thanks a lot for clarifying! -Neha
On Fri, Mar 21, 2014 at 11:01 AM, Bae, Jae Hyeon <metac...@gmail.com> wrote: > Let me clarify the situation. I forgot to mention that my case might not be > general one because Netflix is using Apache Curator as the main zookeeper > client and ZkClient in Kafka should be bridged to Apache Curator, so the > behavior I have seen might not be general one. > > Kafka's ZKSessionExpireListener.handleNewSession() is reinstating all > ephemeral nodes and watchers but handleNewSession() was not kicked in my > case. So, I created Netflix internal version of ZkClient to replace > ephemeral node creation and watcher reinstating. > > I have a plan to remove all external dependency from Kafka soon and I might > be able to mention more about Kafka's zk resiliency. > > > On Fri, Mar 21, 2014 at 7:51 AM, Jun Rao <jun...@gmail.com> wrote: > > > Hi, Jae, > > > > My understanding is that ephemeral nodes can only be lost on ZK session > > expiration. Are you saying that there are other cases when ephemeral > nodes > > can be lost? > > > > Thanks, > > > > Jun > > > > > > On Thu, Mar 20, 2014 at 9:52 PM, Bae, Jae Hyeon <metac...@gmail.com> > > wrote: > > > > > This issue is zookeeper resiliency. > > > > > > What I have done is, ephemeral node creation is replaced by Apache > > > Curator's PersistentEphemeralNode recipe, to reinstate ephemeral nodes > > > after zookeeper blip. Also, all watchers also should be reinstated. > Kafka > > > internally only handles session expired event but there are many bad > > > situations of zookeepers where Kafka can go haywire. > > > > > > > > > > > > > > > On Thu, Mar 20, 2014 at 9:23 PM, Neha Narkhede < > neha.narkh...@gmail.com > > > >wrote: > > > > > > > 1. What does 'zookeeper state changed (Expired)' mean? > > > > > > > > 2. Has anyone seen issues like this before? Where zookeeper > > connections > > > > are flaky enough to cause leader elections? > > > > > > > > It means zookeeper expired the session. The most common reason for > this > > > is > > > > client side GC (in your case, client is the Kafka broker) though if > > your > > > > zookeeper setup is flaky, this might even point to a server side I/O > > > > bottleneck or GC. Could you inspect the broker and zookeeper's gc > logs? > > > > > > > > > > > > On Thu, Mar 20, 2014 at 3:26 PM, Andrew Otto <o...@wikimedia.org> > > wrote: > > > > > > > > > Hm, just saw something a little fishy. > > > > > > > > > > (In the following logs, analytics1021 (id 21) and analytics1022 (id > > 22) > > > > > are Brokers and analytics1023,analytics1024,analytics1025 are > > > > Zookeepers.) > > > > > > > > > > At 2014-03-20 21:12:26, analytics1021 lost its connection to > > zookeeper. > > > > It > > > > > reconnected to analytics1023, but as it was doing so, it logged a > > > > > 'zookeeper state changed (Expired)' message, which apparently > caused > > > the > > > > > newly opened socket to close again. Am I reading that right? > > > > > > > > > > In the time that it took analytics1021 to finally re-establish a > > > > zookeeper > > > > > connection, analytics1022 had become the leader for all partitions. > > I > > > > can > > > > > rebalance the leaders, but I'd prefer if this didn't happen in the > > > first > > > > > place. > > > > > > > > > > 1. What does 'zookeeper state changed (Expired)' mean? > > > > > 2. Has anyone seen issues like this before? Where zookeeper > > > connections > > > > > are flaky enough to cause leader elections? > > > > > > > > > > Thanks! > > > > > -Andrew Otto > > > > > > > > > > > > > > > > > > > > [2014-03-20 21:12:26,427] 685698013 > > > > > [main-SendThread(analytics1025.eqiad.wmnet:2181)] INFO > > > > > org.apache.zookeeper.ClientCnxn - Client session timed out, have > not > > > > heard > > > > > from server in 10067ms for sessionid 0x1425d903a3f1f89, closing > > socket > > > > > connection and attempting reconnect > > > > > [2014-03-20 21:12:26,528] 685698114 [main-EventThread] INFO > > > > > org.I0Itec.zkclient.ZkClient - zookeeper state changed > (Disconnected) > > > > > [2014-03-20 21:12:27,252] 685698838 > > > > > [main-SendThread(analytics1025.eqiad.wmnet:2181)] INFO > > > > > org.apache.zookeeper.ClientCnxn - Opening socket connection to > server > > > > > analytics1023.eqiad.wmnet/10.64.5.15:2181 > > > > > [2014-03-20 21:12:27,271] 685698857 > > > > > [main-SendThread(analytics1023.eqiad.wmnet:2181)] INFO > > > > > org.apache.zookeeper.ClientCnxn - Socket connection established to > > > > > analytics1023.eqiad.wmnet/10.64.5.15:2181, initiating session > > > > > [2014-03-20 21:12:27,273] 685698859 [main-EventThread] INFO > > > > > org.I0Itec.zkclient.ZkClient - zookeeper state changed (Expired) > > > > > [2014-03-20 21:12:27,273] 685698859 > > > > > [main-SendThread(analytics1023.eqiad.wmnet:2181)] INFO > > > > > org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper > > > > service, > > > > > session 0x1425d903a3f1f89 has expired, closing socket connection > > > > > [2014-03-20 21:12:27,273] 685698859 [main-EventThread] INFO > > > > > org.apache.zookeeper.ZooKeeper - Initiating client connection, > > > > > > > > > > > > > > > connectString=analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad > > > > > sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@63203b59 > > > > > [2014-03-20 21:12:27,290] 685698876 > > > > > > > > > > > > > > > [ZkClient-EventThread-24-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad] > > > > > INFO kafka.server.KafkaZooKeeper - re-registering broker info in ZK > > for > > > > > broker 21 > > > > > [2014-03-20 21:12:27,290] 685698876 [main-EventThread] INFO > > > > > org.apache.zookeeper.ClientCnxn - EventThread shut down > > > > > [2014-03-20 21:12:27,290] 685698876 [main-SendThread()] INFO > > > > > org.apache.zookeeper.ClientCnxn - Opening socket connection to > server > > > > > analytics1023.eqiad.wmnet/10.64.5.15:2181 > > > > > [2014-03-20 21:12:27,292] 685698878 > > > > > [main-SendThread(analytics1023.eqiad.wmnet:2181)] INFO > > > > > org.apache.zookeeper.ClientCnxn - Socket connection established to > > > > > analytics1023.eqiad.wmnet/10.64.5.15:2181, initiating session > > > > > [2014-03-20 21:12:27,313] 685698899 > > > > > [main-SendThread(analytics1023.eqiad.wmnet:2181)] INFO > > > > > org.apache.zookeeper.ClientCnxn - Session establishment complete on > > > > server > > > > > analytics1023.eqiad.wmnet/10.64.5.15:2181, sessionid = > > > > > 0xff429b4201dd2352, negotiated timeout = 6000 > > > > > [2014-03-20 21:12:27,314] 685698900 [main-EventThread] INFO > > > > > org.I0Itec.zkclient.ZkClient - zookeeper state changed > > (SyncConnected) > > > > > [2014-03-20 21:12:27,324] 685698910 > > > > > > > > > > > > > > > [ZkClient-EventThread-24-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad] > > > > > INFO kafka.server.KafkaZooKeeper - done re-registering broker > > > > > [2014-03-20 21:12:27,324] 685698910 > > > > > > > > > > > > > > > [ZkClient-EventThread-24-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad] > > > > > INFO kafka.server.KafkaZooKeeper - Subscribing to /brokers/topics > > path > > > to > > > > > watch for new topics > > > > > [2014-03-20 21:12:27,370] 685698956 > > > > > > > > > > > > > > > [ZkClient-EventThread-24-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad] > > > > > INFO kafka.server.ZookeeperLeaderElector$LeaderChangeListener - New > > > > leader > > > > > is 22 > > > > > > > > > > > > > > > > > > > > > > > > >