[ https://issues.apache.org/jira/browse/KAFKA-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13974420#comment-13974420 ]
Jay Kreps commented on KAFKA-1398: ---------------------------------- Updated reviewboard https://reviews.apache.org/r/20471/ against branch trunk > Topic config changes can be lost and cause fatal exceptions on broker restarts > ------------------------------------------------------------------------------ > > Key: KAFKA-1398 > URL: https://issues.apache.org/jira/browse/KAFKA-1398 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8.1 > Reporter: Joel Koshy > Assignee: Jay Kreps > Priority: Critical > Fix For: 0.8.1.1 > > Attachments: KAFKA-1398.patch, KAFKA-1398.patch, KAFKA-1398.patch, > KAFKA-1398_2014-04-18_13:03:03.patch > > > Our topic config cleanup policy seems to be broken. When a broker is > bounced and starting up: > 1 - Read all the children of the config change path > 2 - For each, if the change id is greater than the last executed change, > then extract the topic information. > 3 - If there is a log for that topic on this broker, then apply the change. > However, if there is no log, then delete the config change. > In step 3, a delete triggers a child change watch firing on all the other > brokers. The other brokers currently take all the children of the config > path but will ignore those config changes that are less than the last > executed change. At least one issue here is that if a broker does not have > partitions for a topic then the lastExecutedChange is not updated (for > that topic). > Consider this scenario: > - Three brokers 0, 1, 2 > - Topic A has partitions only assigned to broker 0 > - Topic B has partitions only assigned to broker 1 > - Topic C has partitions only assigned to broker 2 > - Change 0: topic A > - Change 1: topic B > - Change 2: topic C > - lastExecutedChange on broker 0 is 0 > - lastExecutedChange on broker 1 is 1 > - lastExecutedChange on broker 2 is 2 > - Bounce broker 1 > - The above bounce will cause Change 0 and Change 2 to get deleted. > - Watch fires on broker 0 and 1 > - Broker 0 will try and read the topic corresponding to change 1 (since its > lastExecutedChange is 0) and then change 2. That read will fail: > 2014/04/15 19:35:34.236 INFO [TopicConfigManager] [main] [kafka-server] [] > Processed topic config change 25 for topic xyz, setting new config to > {retention.ms=3600000, segment.ms=3600000}. > 2014/04/15 19:35:34.238 FATAL [KafkaServerStartable] [main] [kafka-server] [] > Fatal error during KafkaServerStable startup. Prepare to shutdown > org.I0Itec.zkclient.exception.ZkNoNodeException: > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = > NoNode for /config/changes/config_change_0000000026 > at > org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47) > at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:685) > at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:766) > at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:761) > at kafka.utils.ZkUtils$.readData(ZkUtils.scala:467) > at > kafka.server.TopicConfigManager$$anonfun$kafka$server$TopicConfigManager$$processConfigChanges$2.apply(TopicConfigManager.scala:97) > at > kafka.server.TopicConfigManager$$anonfun$kafka$server$TopicConfigManager$$processConfigChanges$2.apply(TopicConfigManager.scala:93) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:57) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:43) > at > kafka.server.TopicConfigManager.kafka$server$TopicConfigManager$$processConfigChanges(TopicConfigManager.scala:93) > at > kafka.server.TopicConfigManager.processAllConfigChanges(TopicConfigManager.scala:81) > at > kafka.server.TopicConfigManager.startup(TopicConfigManager.scala:72) > at kafka.server.KafkaServer.startup(KafkaServer.scala:104) > at > kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:34) > ... > Caused by: org.apache.zookeeper.KeeperException$NoNodeException: > KeeperErrorCode = NoNode for /config/changes/config_change_0000000026 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:102) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:42) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:956) > at org.I0Itec.zkclient.ZkConnection.readData(ZkConnection.java:103) > at org.I0Itec.zkclient.ZkClient$9.call(ZkClient.java:770) > at org.I0Itec.zkclient.ZkClient$9.call(ZkClient.java:766) > at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675) > ... 39 more > Another issue is that there are two logging statements with incorrect > qualifiers which makes things a little harder to debug. E.g., > 2014/04/15 19:35:34.223 ERROR [TopicConfigManager] [kafka-server] [] Ignoring > topic config change %d for topic %s since the change has expired -- This message was sent by Atlassian JIRA (v6.2#6252)