Looks like a bug to me -- the underlying ZK library wraps a lot of blocking
method implementations with waitUntilConnected() calls without any
timeouts. Ideally we could just add a version of ZkUtils.getController()
with a timeout, but I don't see an easy way to accomplish that with
ZkClient.

There's at least one other call to ZkUtils besides the one in the
stacktrace you gave that would cause the same issue, possibly more that
aren't directly called in that method. One ugly solution would be to use an
extra thread during shutdown to trigger timeouts, but I'd imagine we
probably have other threads that could end up blocking in similar ways.

I filed https://issues.apache.org/jira/browse/KAFKA-1907 to track the issue.


On Mon, Jan 26, 2015 at 6:35 AM, Jaikiran Pai <jai.forums2...@gmail.com>
wrote:

> The main culprit is this thread which goes into "forever retry connection
> to a closed zookeeper" when I shutdown Kafka (via a Ctrl + C) after
> zookeeper has already been shutdown. I have attached the complete thread
> dump, but I don't know if it will be delivered to the mailing list.
>
> "Thread-2" prio=10 tid=0xb3305000 nid=0x4758 waiting on condition
> [0x6ad69000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
>     at sun.misc.Unsafe.park(Native Method)
>     - parking to wait for  <0x70a93368> (a java.util.concurrent.locks.
> AbstractQueuedSynchronizer$ConditionObject)
>     at java.util.concurrent.locks.LockSupport.parkUntil(
> LockSupport.java:267)
>     at java.util.concurrent.locks.AbstractQueuedSynchronizer$
> ConditionObject.awaitUntil(AbstractQueuedSynchronizer.java:2130)
>     at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:636)
>     at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:619)
>     at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:615)
>     at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:679)
>     at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:766)
>     at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:761)
>     at kafka.utils.ZkUtils$.readDataMaybeNull(ZkUtils.scala:456)
>     at kafka.utils.ZkUtils$.getController(ZkUtils.scala:65)
>     at kafka.server.KafkaServer.kafka$server$KafkaServer$$
> controlledShutdown(KafkaServer.scala:194)
>     at kafka.server.KafkaServer$$anonfun$shutdown$1.apply$mcV$
> sp(KafkaServer.scala:269)
>     at kafka.utils.Utils$.swallow(Utils.scala:172)
>     at kafka.utils.Logging$class.swallowWarn(Logging.scala:92)
>     at kafka.utils.Utils$.swallowWarn(Utils.scala:45)
>     at kafka.utils.Logging$class.swallow(Logging.scala:94)
>     at kafka.utils.Utils$.swallow(Utils.scala:45)
>     at kafka.server.KafkaServer.shutdown(KafkaServer.scala:269)
>     at kafka.server.KafkaServerStartable.shutdown(
> KafkaServerStartable.scala:42)
>     at kafka.Kafka$$anon$1.run(Kafka.scala:42)
>
> -Jaikiran
>
>
> On Monday 26 January 2015 05:46 AM, Neha Narkhede wrote:
>
>> For a clean shutdown, the broker tries to talk to the controller and also
>> issues reads to zookeeper. Possibly that is where it tries to reconnect to
>> zk. It will help to look at the thread dump.
>>
>> Thanks
>> Neha
>>
>> On Fri, Jan 23, 2015 at 8:53 PM, Jaikiran Pai <jai.forums2...@gmail.com>
>> wrote:
>>
>>  I was just playing around with the RC2 of 0.8.2 and noticed that if I
>>> shutdown zookeeper first I can't shutdown Kafka server at all since it
>>> goes
>>> into a never ending attempt to reconnect with zookeeper. I had to kill
>>> the
>>> Kafka process to stop it. I tried it against trunk too and there too I
>>> see
>>> the same issue. Should I file a JIRA for this and see if I can come up
>>> with
>>> a patch?
>>>
>>> FWIW, here's the unending (and IMO too frequent) attempts at trying to
>>> reconnect. I've a thread dump too which shows that the other thread which
>>> is trying to complete a controlled shutdown of Kafka is blocked forever
>>> for
>>> the zookeeper to be up. I can attach it to the JIRA.
>>>
>>> 2015-01-24 10:15:46,278] WARN Session 0x14b1a4136800000 for server null,
>>> unexpected error, closing socket connection and attempting reconnect
>>> (org.apache.zookeeper.ClientCnxn)
>>> java.net.ConnectException: Connection refused
>>>      at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>>      at sun.nio.ch.SocketChannelImpl.finishConnect(
>>> SocketChannelImpl.java:739)
>>>      at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(
>>> ClientCnxnSocketNIO.java:361)
>>>      at org.apache.zookeeper.ClientCnxn$SendThread.run(
>>> ClientCnxn.java:1081)
>>> [2015-01-24 10:15:47,437] INFO Opening socket connection to server
>>> localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL
>>> (unknown error) (org.apache.zookeeper.ClientCnxn)
>>> [2015-01-24 10:15:47,438] WARN Session 0x14b1a4136800000 for server null,
>>> unexpected error, closing socket connection and attempting reconnect
>>> (org.apache.zookeeper.ClientCnxn)
>>> java.net.ConnectException: Connection refused
>>>      at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>>      at sun.nio.ch.SocketChannelImpl.finishConnect(
>>> SocketChannelImpl.java:739)
>>>      at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(
>>> ClientCnxnSocketNIO.java:361)
>>>      at org.apache.zookeeper.ClientCnxn$SendThread.run(
>>> ClientCnxn.java:1081)
>>> [2015-01-24 10:15:49,056] INFO Opening socket connection to server
>>> localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL
>>> (unknown error) (org.apache.zookeeper.ClientCnxn)
>>> [2015-01-24 10:15:49,057] WARN Session 0x14b1a4136800000 for server null,
>>> unexpected error, closing socket connection and attempting reconnect
>>> (org.apache.zookeeper.ClientCnxn)
>>> java.net.ConnectException: Connection refused
>>>      at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>>      at sun.nio.ch.SocketChannelImpl.finishConnect(
>>> SocketChannelImpl.java:739)
>>>      at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(
>>> ClientCnxnSocketNIO.java:361)
>>>      at org.apache.zookeeper.ClientCnxn$SendThread.run(
>>> ClientCnxn.java:1081)
>>> [2015-01-24 10:15:50,801] INFO Opening socket connection to server
>>> localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL
>>> (unknown error) (org.apache.zookeeper.ClientCnxn)
>>> [2015-01-24 10:15:50,802] WARN Session 0x14b1a4136800000 for server null,
>>> unexpected error, closing socket connection and attempting reconnect
>>> (org.apache.zookeeper.ClientCnxn)
>>> java.net.ConnectException: Connection refused
>>>      at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>>      at sun.nio.ch.SocketChannelImpl.finishConnect(
>>> SocketChannelImpl.java:739)
>>>      at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(
>>> ClientCnxnSocketNIO.java:361)
>>>      at org.apache.zookeeper.ClientCnxn$SendThread.run(
>>> ClientCnxn.java:1081)
>>>
>>>
>>>
>>>
>>> -Jaikiran
>>>
>>>
>>
>>
>


-- 
Thanks,
Ewen

Reply via email to