[ https://issues.apache.org/jira/browse/KAFKA-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13976597#comment-13976597 ]
Dmitry Bugaychenko commented on KAFKA-1407: ------------------------------------------- Checked out the logs in more details - there were problems in connection with ZK and it looks like they ended up in the deadlock between ZkClient thread and delete-topics threads: {code} [2014-04-22 02:37:40,597] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient) ... [2014-04-22 02:37:42,067] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient) ... [2014-04-22 02:40:18,217] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient) ... [2014-04-22 02:40:42,523] INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient) [2014-04-22 02:40:42,525] INFO [SessionExpirationListener on 1], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener) [2014-04-22 02:40:42,527] DEBUG Shutting down task scheduler. (kafka.utils.KafkaScheduler) [2014-04-22 02:40:42,528] ERROR Error handling event ZkEvent[New session event sent to kafka.controller.KafkaController$SessionExpirationListener@6fb84e18] (org.Itec.zkclient.ZkEventThread) ava.lang.IllegalStateException: Kafka scheduler has not been started at kafka.utils.KafkaScheduler.ensureStarted(KafkaScheduler.scala:116) at kafka.utils.KafkaScheduler.shutdown(KafkaScheduler.scala:86) at kafka.controller.KafkaController$$anonfun$onControllerResignation$1.apply$mcV$sp(KafkaController.scala:339) at kafka.controller.KafkaController$$anonfun$onControllerResignation$1.apply(KafkaController.scala:337) at kafka.controller.KafkaController$$anonfun$onControllerResignation$1.apply(KafkaController.scala:337) at kafka.utils.Utils$.inLock(Utils.scala:538) at kafka.controller.KafkaController.onControllerResignation(KafkaController.scala:337) at kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply$mcZ$sp(KafkaController.scala:1068) at kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply(KafkaController.scala:1067) at kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply(KafkaController.scala:1067) at kafka.utils.Utils$.inLock(Utils.scala:538) at kafka.controller.KafkaController$SessionExpirationListener.handleNewSession(KafkaController.scala:1067) at org.I0Itec.zkclient.ZkClient$4.run(ZkClient.java:472) at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) [2014-04-22 02:40:42,528] INFO Session establishment complete on server devlnx2.odnoklassniki.ru/172.18.10.101:2181, sessionid = 0x14545879f6c52a6, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn) [2014-04-22 02:40:42,529] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient) [2014-04-22 02:40:42,529] INFO re-registering broker info in ZK for broker 1 (kafka.server.KafkaHealthcheck) [2014-04-22 02:40:42,843] INFO Registered broker 1 at path /brokers/ids/1 with address devlnx5:9092. (kafka.utils.ZkUtils$) [2014-04-22 02:40:42,843] INFO done re-registering broker (kafka.server.KafkaHealthcheck) [2014-04-22 02:40:42,844] INFO Subscribing to /brokers/topics path to watch for new topics (kafka.server.KafkaHealthcheck) [2014-04-22 02:40:44,841] INFO [ControllerEpochListener on 1]: Initialized controller epoch to 53 and zk version 52 (kafka.controller.ControllerEpochListener) [2014-04-22 02:40:46,152] INFO Reloading nodes from /cluster/databus-monitoring-service/servers (one.zookeeper.client.childmon.NodeChildrenMonitor) [2014-04-22 02:40:47,323] INFO 1 successfully elected as leader (kafka.server.ZookeeperLeaderElector) [2014-04-22 02:40:47,324] INFO [Controller 1]: Broker 1 starting become controller state transition (kafka.controller.KafkaController) [2014-04-22 02:40:47,392] INFO Reloading nodes from /cluster/databus-monitoring-service/servers (one.zookeeper.client.childmon.NodeChildrenMonitor) [2014-04-22 02:40:48,389] INFO [Controller 1]: Controller 1 incremented epoch to 54 (kafka.controller.KafkaController) [2014-04-22 02:41:01,041] INFO [Controller 1]: Broker 1 is ready to serve as the new controller with epoch 54 (kafka.controller.KafkaController) [2014-04-22 02:41:01,042] INFO [Controller 1]: Starting preferred replica leader election for partitions (kafka.controller.KafkaController) [2014-04-22 02:41:01,043] INFO [Partition state machine on Controller 1]: Invoking state change to OnlinePartition for partitions (kafka.controller.PartitionStateMachine) [2014-04-22 02:41:01,048] INFO [Controller 1]: starting the partition rebalance scheduler (kafka.controller.KafkaController) [2014-04-22 02:41:01,048] DEBUG Initializing task scheduler. (kafka.utils.KafkaScheduler) [2014-04-22 02:41:01,049] DEBUG Scheduling task partition-rebalance-thread with initial delay 5000 ms and period 300000 ms. (kafka.utils.KafkaScheduler) [2014-04-22 02:41:01,050] INFO [delete-topics-thread], Starting (kafka.controller.TopicDeletionManager$DeleteTopicsThread) [2014-04-22 02:41:01,053] INFO New leader is 1 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) [2014-04-22 02:41:01,054] INFO Waiting for signal to start or continue topic deletion (kafka.controller.TopicDeletionManager) [2014-04-22 02:41:01,055] INFO [ControllerEpochListener on 1]: Initialized controller epoch to 54 and zk version 53 (kafka.controller.ControllerEpochListener) [2014-04-22 02:41:01,283] INFO [ReplicaFetcherThread-0-2], Stopped (kafka.server.ReplicaFetcherThread) [2014-04-22 02:41:01,283] INFO [ReplicaFetcherThread-0-2], Shutdown completed (kafka.server.ReplicaFetcherThread) ... [2014-04-22 02:48:33,567] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient) [2014-04-22 02:48:34,812] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient) [2014-04-22 02:48:40,380] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient) [2014-04-22 02:48:49,008] INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient) [2014-04-22 02:48:49,010] INFO [SessionExpirationListener on 1], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener) [2014-04-22 02:48:49,010] DEBUG Shutting down task scheduler. (kafka.utils.KafkaScheduler) [2014-04-22 02:48:49,011] INFO [delete-topics-thread], Shutting down (kafka.controller.TopicDeletionManager$DeleteTopicsThread) ... [2014-04-22 02:48:51,079] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient) ... [2014-04-22 02:49:19,988] INFO Partition [metrics,6] on broker 1: Shrinking ISR for partition [metrics,6] from 1,3 to 1 (kafka.cluster.Partition) [2014-04-22 02:49:19,990] ERROR Conditional update of path /brokers/topics/metrics/partitions/6/state with data {"controller_epoch":52,"leader":1,"version":1,"leader_epoch":46,"isr":[1]} and expected version 80 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/metrics/partitions/6/state (kafka.utils.ZkUtils$) [2014-04-22 02:49:19,991] INFO Partition [metrics,6] on broker 1: Cached zkVersion [80] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) {code} {code} "ZkClient-EventThread-37-devlnx2:2181" daemon prio=10 tid=0x00007f6c78590000 nid=0x40b2 waiting on condition [ java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000071fbcfae8> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronize at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSync at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchr at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:236) at kafka.utils.ShutdownableThread.shutdown(ShutdownableThread.scala:36) at kafka.controller.TopicDeletionManager.shutdown(TopicDeletionManager.scala:93) at kafka.controller.KafkaController$$anonfun$onControllerResignation$1.apply$mcV$sp(KafkaController.scal at kafka.controller.KafkaController$$anonfun$onControllerResignation$1.apply(KafkaController.scala:337) at kafka.controller.KafkaController$$anonfun$onControllerResignation$1.apply(KafkaController.scala:337) at kafka.utils.Utils$.inLock(Utils.scala:538) at kafka.controller.KafkaController.onControllerResignation(KafkaController.scala:337) at kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply$mcZ$sp(K at kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply(KafkaCon at kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply(KafkaCon at kafka.utils.Utils$.inLock(Utils.scala:538) at kafka.controller.KafkaController$SessionExpirationListener.handleNewSession(KafkaController.scala:106 at org.I0Itec.zkclient.ZkClient$4.run(ZkClient.java:472) at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) "delete-topics-thread" prio=10 tid=0x00007f6c2801d000 nid=0x61f1 waiting on condition [0x00007f6cbf2f1000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000071f594c08> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronize at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:8 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronize at kafka.controller.TopicDeletionManager.kafka$controller$TopicDeletionManager$$awaitTopicDeletionNotifi at kafka.controller.TopicDeletionManager$DeleteTopicsThread$$anonfun$doWork$1.apply$mcV$sp(TopicDeletion at kafka.controller.TopicDeletionManager$DeleteTopicsThread$$anonfun$doWork$1.apply(TopicDeletionManager at kafka.controller.TopicDeletionManager$DeleteTopicsThread$$anonfun$doWork$1.apply(TopicDeletionManager at kafka.utils.Utils$.inLock(Utils.scala:538) at kafka.controller.TopicDeletionManager$DeleteTopicsThread.doWork(TopicDeletionManager.scala:333) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51) {code} We are using 0.8.1 and I tried to find an option to disbale topic deletion, but didn't manage... > Broker can not return to ISR because of BadVersionException > ----------------------------------------------------------- > > Key: KAFKA-1407 > URL: https://issues.apache.org/jira/browse/KAFKA-1407 > Project: Kafka > Issue Type: Bug > Components: controller > Affects Versions: 0.8.1 > Reporter: Dmitry Bugaychenko > Assignee: Neha Narkhede > > Each morning we found a broker out of ISR at stuck with log full of messages: > {code} > INFO | jvm 1 | 2014/04/21 08:36:21 | [2014-04-21 09:36:21,907] ERROR > Conditional update of path /brokers/topics/topic2/partitions/1/state with > data > {"controller_epoch":46,"leader":2,"version":1,"leader_epoch":38,"isr":[2]} > and expected version 53 failed due to > org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = > BadVersion for /brokers/topics/topic2/partitions/1/state > (kafka.utils.ZkUtils$) > INFO | jvm 1 | 2014/04/21 08:36:21 | [2014-04-21 09:36:21,907] INFO > Partition [topic2,1] on broker 2: Cached zkVersion [53] not equal to that in > zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > It seems that it can not recover after short netwrok break down and the only > way to return it is restart it using kill -9. -- This message was sent by Atlassian JIRA (v6.2#6252)