While on the subject of zkclient, also consider KAFKA-1793. A more abstract interface to the distributed coordination service that could be configured to use alternatives like consul or etcd would be very useful imho.
Dana FWIW - the ZkClient project team have merged the pull request that I had submitted to allow for timeouts to operations https://github.com/sgroschupf/ zkclient/pull/29. I heard from Johannes (from the ZkClient project team) that they don't have any specific release date in mind but are willing to release a new version if/when we need one. -Jaikiran On Wednesday 04 February 2015 12:33 AM, Gwen Shapira wrote: > So I think the current plan is: > 1. Add timeout in zkclient > 2. Ask zkclient to release new version (we need it for few other things > too) > 3. Rebase on new zkclient > 4. Fix this jira and the few others than were waiting for the new zkclient > > Does that make sense? > > Gwen > > On Mon, Feb 2, 2015 at 8:33 PM, Jaikiran Pai <jai.forums2...@gmail.com> > wrote: > >> I just heard back from Stefan, who manages the ZkClient repo and he seems >> to >> be open to have these changes be part of ZkClient project. I'll be >> creating >> a pull request for that project to have it reviewed and merged. Although I >> haven't heard of exact release plans, Stefan's reply did indicate that the >> project could be released after this change is merged. >> >> -Jaikiran >> >> On Tuesday 03 February 2015 09:03 AM, Jaikiran Pai wrote: >> >>> Thanks for pointing to that repo! >>> >>> I just had a look at it and it appears that the project isn't much active >>> (going by the lack of activity). The latest contribution is from Gwen and >>> that was around 3 months back. I haven't found release plans for that >>> project or a place to ask about it (filing an issue doesn't seem right to >>> ask this question). So I'll get in touch with the repo owner and see what >>> his plans for the project are. >>> >>> -Jaikiran >>> >>> On Monday 02 February 2015 11:33 PM, Gwen Shapira wrote: >>> >>>> I did! >>>> >>>> Thanks for clarifying :) >>>> >>>> The client that is part of Zookeeper itself actually does support >>>> timeouts. >>>> >>>> On Mon, Feb 2, 2015 at 9:54 AM, Guozhang Wang <wangg...@gmail.com> >>>> wrote: >>>> >>>>> Hi Jaikiran, >>>>> >>>>> I think Gwen was talking about contributing to ZkClient project: >>>>> >>>>> https://github.com/sgroschupf/zkclient >>>>> >>>>> Guozhang >>>>> >>>>> >>>>> On Sun, Feb 1, 2015 at 5:30 AM, Jaikiran Pai <jai.forums2...@gmail.com >>>>> > >>>>> wrote: >>>>> >>>>> Hi Gwen, >>>>>> >>>>>> Yes, the KafkaZkClient is a wrapper around ZkClient and not a complete >>>>>> replacement. >>>>>> >>>>>> As for contributing to Zookeeper, yes that indeed in on my mind, but I >>>>>> haven't yet had a chance to really look deeper into Zookeeper or get >>>>>> in >>>>>> touch with their dev team to try and explain this potential >>>>>> improvement >>>>>> to >>>>>> them. I have no objection to contributing this or something similar to >>>>>> Zookeeper directly. I think I should be able to bring this up in the >>>>>> Zookeeper dev forum, sometime soon in the next few weekends. >>>>>> >>>>>> -Jaikiran >>>>>> >>>>>> >>>>>> On Sunday 01 February 2015 11:40 AM, Gwen Shapira wrote: >>>>>> >>>>>> It looks like the new KafkaZkClient is a wrapper around ZkClient, but >>>>>>> not a replacement. Did I get it right? >>>>>>> >>>>>>> I think a wrapper for ZkClient can be useful - for example KAFKA-1664 >>>>>>> can also use one. >>>>>>> >>>>>>> However, I'm wondering why not contribute the fix directly to >>>>>>> ZKClient >>>>>>> project and ask for a release that contains the fix? >>>>>>> This will benefit other users of the project who may also need a >>>>>>> timeout (thats pretty basic...) >>>>>>> >>>>>>> As an alternative, if we don't want to collaborate with ZKClient for >>>>>>> some reason, forking the project into Kafka will probably give us >>>>>>> more >>>>>>> control than wrappers and without much downside. >>>>>>> >>>>>>> Just a thought. >>>>>>> >>>>>>> Gwen >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sat, Jan 31, 2015 at 6:32 AM, Jaikiran Pai >>>>>>> <jai.forums2...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> Neha, Ewen (and others), my initial attempt to solve this is >>>>>>>> uploaded >>>>>>>> here >>>>>>>> https://reviews.apache.org/r/30477/. It solves the shutdown problem >>>>>>>> and >>>>>>>> now >>>>>>>> the server shuts down even when Zookeeper has gone down before the >>>>>>>> Kafka >>>>>>>> server. >>>>>>>> >>>>>>>> I went with the approach of introducing a custom (enhanced) ZkClient >>>>>>>> which >>>>>>>> for now allows time outs to be optionally specified for certain >>>>>>>> operations. >>>>>>>> I intentionally haven't forced the use of this new KafkaZkClient all >>>>>>>> over >>>>>>>> the code and instead for now have just used it in the KafkaServer. >>>>>>>> >>>>>>>> Does this patch look like something worth using? >>>>>>>> >>>>>>>> -Jaikiran >>>>>>>> >>>>>>>> >>>>>>>> On Thursday 29 January 2015 10:41 PM, Neha Narkhede wrote: >>>>>>>> >>>>>>>> Ewen is right. ZkClient APIs are blocking and the right fix for >>>>>>>>> this >>>>>>>>> seems >>>>>>>>> to be patching ZkClient. At some point, if we find ourselves >>>>>>>>> fiddling >>>>>>>>> too >>>>>>>>> much with ZkClient, it wouldn't hurt to write our own little >>>>>>>>> zookeeper >>>>>>>>> client wrapper. >>>>>>>>> >>>>>>>>> On Thu, Jan 29, 2015 at 12:57 AM, Ewen Cheslack-Postava >>>>>>>>> <e...@confluent.io> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Looks like a bug to me -- the underlying ZK library wraps a lot >>>>>>>>> of >>>>>>>>> >>>>>>>>>> blocking >>>>>>>>>> method implementations with waitUntilConnected() calls without any >>>>>>>>>> timeouts. Ideally we could just add a version of >>>>>>>>>> ZkUtils.getController() >>>>>>>>>> with a timeout, but I don't see an easy way to accomplish that >>>>>>>>>> with >>>>>>>>>> ZkClient. >>>>>>>>>> >>>>>>>>>> There's at least one other call to ZkUtils besides the one in the >>>>>>>>>> stacktrace you gave that would cause the same issue, possibly more >>>>>>>>>> that >>>>>>>>>> aren't directly called in that method. One ugly solution would be >>>>>>>>>> to >>>>>>>>>> use >>>>>>>>>> an >>>>>>>>>> extra thread during shutdown to trigger timeouts, but I'd imagine >>>>>>>>>> we >>>>>>>>>> probably have other threads that could end up blocking in similar >>>>>>>>>> ways. >>>>>>>>>> >>>>>>>>>> I filed https://issues.apache.org/jira/browse/KAFKA-1907 to track >>>>>>>>>> the >>>>>>>>>> issue. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Jan 26, 2015 at 6:35 AM, Jaikiran Pai < >>>>>>>>>> jai.forums2...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> The main culprit is this thread which goes into "forever retry >>>>>>>>>> >>>>>>>>>>> connection >>>>>>>>>>> to a closed zookeeper" when I shutdown Kafka (via a Ctrl + C) >>>>>>>>>>> after >>>>>>>>>>> zookeeper has already been shutdown. I have attached the complete >>>>>>>>>>> thread >>>>>>>>>>> dump, but I don't know if it will be delivered to the mailing >>>>>>>>>>> list. >>>>>>>>>>> >>>>>>>>>>> "Thread-2" prio=10 tid=0xb3305000 nid=0x4758 waiting on condition >>>>>>>>>>> [0x6ad69000] >>>>>>>>>>> java.lang.Thread.State: TIMED_WAITING (parking) >>>>>>>>>>> at sun.misc.Unsafe.park(Native Method) >>>>>>>>>>> - parking to wait for <0x70a93368> (a >>>>>>>>>>> java.util.concurrent.locks. >>>>>>>>>>> AbstractQueuedSynchronizer$ConditionObject) >>>>>>>>>>> at java.util.concurrent.locks.LockSupport.parkUntil( >>>>>>>>>>> LockSupport.java:267) >>>>>>>>>>> at java.util.concurrent.locks. >>>>>>>>>>> AbstractQueuedSynchronizer$ >>>>>>>>>>> ConditionObject.awaitUntil(AbstractQueuedSynchronizer.java:2130) >>>>>>>>>>> at >>>>>>>>>>> org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient. >>>>>>>>>>> java:636) >>>>>>>>>>> at >>>>>>>>>>> org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient. >>>>>>>>>>> java:619) >>>>>>>>>>> at >>>>>>>>>>> org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient. >>>>>>>>>>> java:615) >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>>>>> org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient. >>>>>>>>>> java:679) >>>>>>>>>> >>>>>>>>>> at org.I0Itec.zkclient.ZkClient. >>>>>>>>>>> readData(ZkClient.java:766) >>>>>>>>>>> at org.I0Itec.zkclient.ZkClient. >>>>>>>>>>> readData(ZkClient.java:761) >>>>>>>>>>> at >>>>>>>>>>> kafka.utils.ZkUtils$.readDataMaybeNull(ZkUtils.scala:456) >>>>>>>>>>> at kafka.utils.ZkUtils$.getController(ZkUtils.scala:65) >>>>>>>>>>> at kafka.server.KafkaServer.kafka$server$KafkaServer$$ >>>>>>>>>>> controlledShutdown(KafkaServer.scala:194) >>>>>>>>>>> at kafka.server.KafkaServer$$ >>>>>>>>>>> anonfun$shutdown$1.apply$mcV$ >>>>>>>>>>> sp(KafkaServer.scala:269) >>>>>>>>>>> at kafka.utils.Utils$.swallow(Utils.scala:172) >>>>>>>>>>> at kafka.utils.Logging$class. >>>>>>>>>>> swallowWarn(Logging.scala:92) >>>>>>>>>>> at kafka.utils.Utils$.swallowWarn(Utils.scala:45) >>>>>>>>>>> at kafka.utils.Logging$class.swallow(Logging.scala:94) >>>>>>>>>>> at kafka.utils.Utils$.swallow(Utils.scala:45) >>>>>>>>>>> at kafka.server.KafkaServer.shutdown(KafkaServer.scala: >>>>>>>>>>> 269) >>>>>>>>>>> at kafka.server.KafkaServerStartable.shutdown( >>>>>>>>>>> KafkaServerStartable.scala:42) >>>>>>>>>>> at kafka.Kafka$$anon$1.run(Kafka.scala:42) >>>>>>>>>>> >>>>>>>>>>> -Jaikiran >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Monday 26 January 2015 05:46 AM, Neha Narkhede wrote: >>>>>>>>>>> >>>>>>>>>>> For a clean shutdown, the broker tries to talk to the >>>>>>>>>>> controller >>>>>>>>>>> and >>>>>>>>>>> also >>>>>>>>>>> issues reads to zookeeper. Possibly that is where it tries to >>>>>>>>>>> >>>>>>>>>>>> reconnect >>>>>>>>>>>> >>>>>>>>>>>> to >>>>>>>>>>> zk. It will help to look at the thread dump. >>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Neha >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Jan 23, 2015 at 8:53 PM, Jaikiran Pai < >>>>>>>>>>>> jai.forums2...@gmail.com >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> I was just playing around with the RC2 of 0.8.2 and noticed >>>>>>>>>>>> that >>>>>>>>>>>> if I >>>>>>>>>>>> >>>>>>>>>>>> shutdown zookeeper first I can't shutdown Kafka server at all >>>>>>>>>>>>> since >>>>>>>>>>>>> it >>>>>>>>>>>>> goes >>>>>>>>>>>>> into a never ending attempt to reconnect with zookeeper. I had >>>>>>>>>>>>> to >>>>>>>>>>>>> kill >>>>>>>>>>>>> the >>>>>>>>>>>>> Kafka process to stop it. I tried it against trunk too and >>>>>>>>>>>>> there >>>>>>>>>>>>> too I >>>>>>>>>>>>> see >>>>>>>>>>>>> the same issue. Should I file a JIRA for this and see if I can >>>>>>>>>>>>> come >>>>>>>>>>>>> up >>>>>>>>>>>>> with >>>>>>>>>>>>> a patch? >>>>>>>>>>>>> >>>>>>>>>>>>> FWIW, here's the unending (and IMO too frequent) attempts at >>>>>>>>>>>>> trying >>>>>>>>>>>>> to >>>>>>>>>>>>> reconnect. I've a thread dump too which shows that the other >>>>>>>>>>>>> thread >>>>>>>>>>>>> >>>>>>>>>>>>> which >>>>>>>>>>>> >>>>>>>>>>> is trying to complete a controlled shutdown of Kafka is blocked >>>>>>>>>>> >>>>>>>>>>>> forever >>>>>>>>>>>>> for >>>>>>>>>>>>> the zookeeper to be up. I can attach it to the JIRA. >>>>>>>>>>>>> >>>>>>>>>>>>> 2015-01-24 10:15:46,278] WARN Session 0x14b1a4136800000 for >>>>>>>>>>>>> server >>>>>>>>>>>>> >>>>>>>>>>>>> null, >>>>>>>>>>>> >>>>>>>>>>> unexpected error, closing socket connection and attempting >>>>>>>>>>> reconnect >>>>>>>>>>> >>>>>>>>>>>> (org.apache.zookeeper.ClientCnxn) >>>>>>>>>>>>> java.net.ConnectException: Connection refused >>>>>>>>>>>>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native >>>>>>>>>>>>> Method) >>>>>>>>>>>>> at sun.nio.ch.SocketChannelImpl.finishConnect( >>>>>>>>>>>>> SocketChannelImpl.java:739) >>>>>>>>>>>>> at org.apache.zookeeper.ClientCnxnSocketNIO. >>>>>>>>>>>>> doTransport( >>>>>>>>>>>>> ClientCnxnSocketNIO.java:361) >>>>>>>>>>>>> at org.apache.zookeeper.ClientCnxn$SendThread.run( >>>>>>>>>>>>> ClientCnxn.java:1081) >>>>>>>>>>>>> [2015-01-24 10:15:47,437] INFO Opening socket connection to >>>>>>>>>>>>> server >>>>>>>>>>>>> localhost/127.0.0.1:2181. Will not attempt to authenticate >>>>>>>>>>>>> using >>>>>>>>>>>>> SASL >>>>>>>>>>>>> (unknown error) (org.apache.zookeeper.ClientCnxn) >>>>>>>>>>>>> [2015-01-24 10:15:47,438] WARN Session 0x14b1a4136800000 for >>>>>>>>>>>>> server >>>>>>>>>>>>> >>>>>>>>>>>>> null, >>>>>>>>>>>> >>>>>>>>>>> unexpected error, closing socket connection and attempting >>>>>>>>>>> reconnect >>>>>>>>>>> >>>>>>>>>>>> (org.apache.zookeeper.ClientCnxn) >>>>>>>>>>>>> java.net.ConnectException: Connection refused >>>>>>>>>>>>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native >>>>>>>>>>>>> Method) >>>>>>>>>>>>> at sun.nio.ch.SocketChannelImpl.finishConnect( >>>>>>>>>>>>> SocketChannelImpl.java:739) >>>>>>>>>>>>> at org.apache.zookeeper.ClientCnxnSocketNIO. >>>>>>>>>>>>> doTransport( >>>>>>>>>>>>> ClientCnxnSocketNIO.java:361) >>>>>>>>>>>>> at org.apache.zookeeper.ClientCnxn$SendThread.run( >>>>>>>>>>>>> ClientCnxn.java:1081) >>>>>>>>>>>>> [2015-01-24 10:15:49,056] INFO Opening socket connection to >>>>>>>>>>>>> server >>>>>>>>>>>>> localhost/127.0.0.1:2181. Will not attempt to authenticate >>>>>>>>>>>>> using >>>>>>>>>>>>> SASL >>>>>>>>>>>>> (unknown error) (org.apache.zookeeper.ClientCnxn) >>>>>>>>>>>>> [2015-01-24 10:15:49,057] WARN Session 0x14b1a4136800000 for >>>>>>>>>>>>> server >>>>>>>>>>>>> >>>>>>>>>>>>> null, >>>>>>>>>>>> >>>>>>>>>>> unexpected error, closing socket connection and attempting >>>>>>>>>>> reconnect >>>>>>>>>>> >>>>>>>>>>>> (org.apache.zookeeper.ClientCnxn) >>>>>>>>>>>>> java.net.ConnectException: Connection refused >>>>>>>>>>>>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native >>>>>>>>>>>>> Method) >>>>>>>>>>>>> at sun.nio.ch.SocketChannelImpl.finishConnect( >>>>>>>>>>>>> SocketChannelImpl.java:739) >>>>>>>>>>>>> at org.apache.zookeeper.ClientCnxnSocketNIO. >>>>>>>>>>>>> doTransport( >>>>>>>>>>>>> ClientCnxnSocketNIO.java:361) >>>>>>>>>>>>> at org.apache.zookeeper.ClientCnxn$SendThread.run( >>>>>>>>>>>>> ClientCnxn.java:1081) >>>>>>>>>>>>> [2015-01-24 10:15:50,801] INFO Opening socket connection to >>>>>>>>>>>>> server >>>>>>>>>>>>> localhost/127.0.0.1:2181. Will not attempt to authenticate >>>>>>>>>>>>> using >>>>>>>>>>>>> SASL >>>>>>>>>>>>> (unknown error) (org.apache.zookeeper.ClientCnxn) >>>>>>>>>>>>> [2015-01-24 10:15:50,802] WARN Session 0x14b1a4136800000 for >>>>>>>>>>>>> server >>>>>>>>>>>>> >>>>>>>>>>>>> null, >>>>>>>>>>>> >>>>>>>>>>> unexpected error, closing socket connection and attempting >>>>>>>>>>> reconnect >>>>>>>>>>> >>>>>>>>>>>> (org.apache.zookeeper.ClientCnxn) >>>>>>>>>>>>> java.net.ConnectException: Connection refused >>>>>>>>>>>>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native >>>>>>>>>>>>> Method) >>>>>>>>>>>>> at sun.nio.ch.SocketChannelImpl.finishConnect( >>>>>>>>>>>>> SocketChannelImpl.java:739) >>>>>>>>>>>>> at org.apache.zookeeper.ClientCnxnSocketNIO. >>>>>>>>>>>>> doTransport( >>>>>>>>>>>>> ClientCnxnSocketNIO.java:361) >>>>>>>>>>>>> at org.apache.zookeeper.ClientCnxn$SendThread.run( >>>>>>>>>>>>> ClientCnxn.java:1081) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -Jaikiran >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>> Ewen >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>> -- Guozhang >>>>> >>>> >>>