[ https://issues.apache.org/jira/browse/KAFKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15532791#comment-15532791 ]
Ismael Juma commented on KAFKA-4231: ------------------------------------ I'm not aware of any reason why 0.10.0.1 should affect what's stored in ZK. > ZK metadata went inconsistent when migrating 0.10.0.0 -> 0.10.0.1 > ----------------------------------------------------------------- > > Key: KAFKA-4231 > URL: https://issues.apache.org/jira/browse/KAFKA-4231 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.10.0.1 > Environment: 3-node cluster on AWS, 700+ topics, created and dropped > on regular basis, we use Java Kafka client 0.9.0.1 and ZKUtils 0.9.0.1 to > create/drop topics automatically > Reporter: Sergey Alaev > > 1. we updated our Kafka cluster 0.10.0.0 -> 0.10.0.1 6 days ago > 2. yesterday, we got huge amount of server logs looking like: > [2016-09-29 12:33:27,055] ERROR [ReplicaFetcherThread-0-1], Error for > partition > [test-customer-recipient-4115239346516610_test-recipient-4115239346516610,0] > to broker 1:org.apache.kafka.common.errors > .UnknownTopicOrPartitionException: This server does not host this > topic-partition. (kafka.server.ReplicaFetcherThread) > 3. Today I've restarted all three nodes and they failed to start throwing > exception featured below. > Investigation showed that ZK: > /config/topics contained 729 entries > /admin/delete_topics contained 484 entries, all of them present in > /config/topics > /brokers/topics were missing 6 entries present in /config/topics! This was > the cause of startup failure > Removing those 6 entries from /config/topics fixed this issue. > I'm sure that we didn't change ZK data manually, the only possible culpits > are kafka-server 0.10.0.1 and ZKUtils 0.9.0.1 (used only to delete topics) > [2016-09-29 12:52:11,082] ERROR Error while electing or becoming leader on > broker 2 (kafka.server.ZookeeperLeaderElector) > kafka.common.KafkaException: Can't parse json string: null > at kafka.utils.Json$.liftedTree1$1(Json.scala:40) > at kafka.utils.Json$.parseFull(Json.scala:36) > at > kafka.utils.ZkUtils$$anonfun$getReplicaAssignmentForTopics$1.apply(ZkUtils.scala:610) > at > kafka.utils.ZkUtils$$anonfun$getReplicaAssignmentForTopics$1.apply(ZkUtils.scala:606) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > kafka.utils.ZkUtils.getReplicaAssignmentForTopics(ZkUtils.scala:606) > at > kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:744) > at > kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:333) > at > kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:166) > at > kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:231) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141) > at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:824) > at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) > Caused by: java.lang.NullPointerException > Server configuration: > [2016-09-29 12:40:58,190] INFO KafkaConfig values: > advertised.host.name = null > metric.reporters = [] > quota.producer.default = 9223372036854775807 > offsets.topic.num.partitions = 50 > log.flush.interval.messages = 9223372036854775807 > auto.create.topics.enable = true > controller.socket.timeout.ms = 30000 > log.flush.interval.ms = null > principal.builder.class = class > org.apache.kafka.common.security.auth.DefaultPrincipalBuilder > replica.socket.receive.buffer.bytes = 65536 > min.insync.replicas = 2 > replica.fetch.wait.max.ms = 500 > num.recovery.threads.per.data.dir = 1 > ssl.keystore.type = JKS > sasl.mechanism.inter.broker.protocol = GSSAPI > default.replication.factor = 3 > ssl.truststore.password = [hidden] > log.preallocate = false > sasl.kerberos.principal.to.local.rules = [DEFAULT] > fetch.purgatory.purge.interval.requests = 1000 > ssl.endpoint.identification.algorithm = null > replica.socket.timeout.ms = 30000 > message.max.bytes = 1000012 > num.io.threads = 10 > offsets.commit.required.acks = -1 > log.flush.offset.checkpoint.interval.ms = 60000 > delete.topic.enable = true > quota.window.size.seconds = 1 > ssl.truststore.type = JKS > offsets.commit.timeout.ms = 5000 > quota.window.num = 11 > zookeeper.connect = > 10.148.27.151:2181,10.148.27.152:2181,10.148.27.153:2181 > authorizer.class.name = > num.replica.fetchers = 1 > log.retention.ms = null > log.roll.jitter.hours = 0 > log.cleaner.enable = true > offsets.load.buffer.size = 5242880 > log.cleaner.delete.retention.ms = 86400000 > ssl.client.auth = none > controlled.shutdown.max.retries = 3 > queued.max.requests = 500 > offsets.topic.replication.factor = 3 > log.cleaner.threads = 1 > sasl.kerberos.service.name = null > sasl.kerberos.ticket.renew.jitter = 0.05 > socket.request.max.bytes = 104857600 > ssl.trustmanager.algorithm = PKIX > zookeeper.session.timeout.ms = 6000 > log.retention.bytes = -1 > log.message.timestamp.type = CreateTime > sasl.kerberos.min.time.before.relogin = 60000 > zookeeper.set.acl = false > connections.max.idle.ms = 600000 > offsets.retention.minutes = 1440 > replica.fetch.backoff.ms = 1000 > inter.broker.protocol.version = 0.10.0-IV1 > log.retention.hours = 24 > num.partitions = 1 > broker.id.generation.enable = true > listeners = PLAINTEXT://:9092,SSL://:9093 > ssl.provider = null > ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1] > log.roll.ms = null > log.flush.scheduler.interval.ms = 9223372036854775807 > ssl.cipher.suites = null > log.index.size.max.bytes = 10485760 > ssl.keymanager.algorithm = SunX509 > security.inter.broker.protocol = PLAINTEXT > replica.fetch.max.bytes = 1048576 > advertised.port = null > log.cleaner.dedupe.buffer.size = 134217728 > replica.high.watermark.checkpoint.interval.ms = 5000 > log.cleaner.io.buffer.size = 524288 > sasl.kerberos.ticket.renew.window.factor = 0.8 > zookeeper.connection.timeout.ms = 60000 > controlled.shutdown.retry.backoff.ms = 5000 > log.roll.hours = 168 > log.cleanup.policy = delete > host.name = > log.roll.jitter.ms = null > max.connections.per.ip = 2147483647 > offsets.topic.segment.bytes = 104857600 > background.threads = 10 > quota.consumer.default = 9223372036854775807 > request.timeout.ms = 30000 > log.message.format.version = 0.10.0-IV1 > log.index.interval.bytes = 4096 > log.dir = /tmp/kafka-logs > log.segment.bytes = 1073741824 > log.cleaner.backoff.ms = 15000 > offset.metadata.max.bytes = 4096 > ssl.truststore.location = /ssl/server.truststore.jks > group.max.session.timeout.ms = 300000 > ssl.keystore.password = [hidden] > zookeeper.sync.time.ms = 2000 > port = 9092 > log.retention.minutes = null > log.segment.delete.delay.ms = 60000 > log.dirs = /data > controlled.shutdown.enable = true > compression.type = producer > max.connections.per.ip.overrides = > log.message.timestamp.difference.max.ms = 9223372036854775807 > sasl.kerberos.kinit.cmd = /usr/bin/kinit > log.cleaner.io.max.bytes.per.second = 1.7976931348623157E308 > auto.leader.rebalance.enable = true > leader.imbalance.check.interval.seconds = 300 > log.cleaner.min.cleanable.ratio = 0.5 > replica.lag.time.max.ms = 10000 > num.network.threads = 5 > ssl.key.password = [hidden] > reserved.broker.max.id = 1000 > metrics.num.samples = 2 > socket.send.buffer.bytes = 102400 > ssl.protocol = TLS > socket.receive.buffer.bytes = 102400 > ssl.keystore.location = /ssl/server.keystore.jks > replica.fetch.min.bytes = 1 > broker.rack = null > unclean.leader.election.enable = true > sasl.enabled.mechanisms = [GSSAPI] > group.min.session.timeout.ms = 6000 > log.cleaner.io.buffer.load.factor = 0.9 > offsets.retention.check.interval.ms = 600000 > producer.purgatory.purge.interval.requests = 1000 > metrics.sample.window.ms = 30000 > broker.id = 3 > offsets.topic.compression.codec = 0 > log.retention.check.interval.ms = 300000 > advertised.listeners = PLAINTEXT://:9092,SSL://:9093 > leader.imbalance.per.broker.percentage = 10 > (kafka.server.KafkaConfig) -- This message was sent by Atlassian JIRA (v6.3.4#6332)