Sergey Alaev created KAFKA-4231:
-----------------------------------

             Summary: ZK metadata went inconsistent when migrating 0.10.0.0 -> 
0.10.0.1
                 Key: KAFKA-4231
                 URL: https://issues.apache.org/jira/browse/KAFKA-4231
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 0.10.0.1
         Environment: 3-node cluster on AWS, 700+ topics, created and dropped 
on regular basis, we use Java Kafka client 0.9.0.1 and ZKUtils 0.9.0.1 to 
create/drop topics automatically
            Reporter: Sergey Alaev


1. we updated our Kafka cluster 0.10.0.0 -> 0.10.0.1 6 days ago
2. yesterday, we got huge amount of server logs looking like:
[2016-09-29 12:33:27,055] ERROR [ReplicaFetcherThread-0-1], Error for partition 
[test-customer-recipient-4115239346516610_test-recipient-4115239346516610,0] to 
broker 1:org.apache.kafka.common.errors
.UnknownTopicOrPartitionException: This server does not host this 
topic-partition. (kafka.server.ReplicaFetcherThread)
3. Today I've restarted all three nodes and they failed to start throwing 
exception featured below.

Investigation showed that ZK:
 /config/topics contained 729 entries
/admin/delete_topics contained 484 entries, all of them present in 
/config/topics
/brokers/topics were missing 6 entries present in /config/topics! This was the 
cause of startup failure

Removing those 6 entries from /config/topics fixed this issue.

I'm sure that we didn't change ZK data manually, the only possible culpits are 
kafka-server 0.10.0.1 and ZKUtils 0.9.0.1 (used only to delete topics)


[2016-09-29 12:52:11,082] ERROR Error while electing or becoming leader on 
broker 2 (kafka.server.ZookeeperLeaderElector)
kafka.common.KafkaException: Can't parse json string: null
        at kafka.utils.Json$.liftedTree1$1(Json.scala:40)
        at kafka.utils.Json$.parseFull(Json.scala:36)
        at 
kafka.utils.ZkUtils$$anonfun$getReplicaAssignmentForTopics$1.apply(ZkUtils.scala:610)
        at 
kafka.utils.ZkUtils$$anonfun$getReplicaAssignmentForTopics$1.apply(ZkUtils.scala:606)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at kafka.utils.ZkUtils.getReplicaAssignmentForTopics(ZkUtils.scala:606)
        at 
kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:744)
        at 
kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:333)
        at 
kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:166)
        at 
kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84)
        at 
kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146)
        at 
kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
        at 
kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
        at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:231)
        at 
kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141)
        at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:824)
        at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
Caused by: java.lang.NullPointerException

Server configuration:
[2016-09-29 12:40:58,190] INFO KafkaConfig values:
        advertised.host.name = null
        metric.reporters = []
        quota.producer.default = 9223372036854775807
        offsets.topic.num.partitions = 50
        log.flush.interval.messages = 9223372036854775807
        auto.create.topics.enable = true
        controller.socket.timeout.ms = 30000
        log.flush.interval.ms = null
        principal.builder.class = class 
org.apache.kafka.common.security.auth.DefaultPrincipalBuilder
        replica.socket.receive.buffer.bytes = 65536
        min.insync.replicas = 2
        replica.fetch.wait.max.ms = 500
        num.recovery.threads.per.data.dir = 1
        ssl.keystore.type = JKS
        sasl.mechanism.inter.broker.protocol = GSSAPI
        default.replication.factor = 3
        ssl.truststore.password = [hidden]
        log.preallocate = false
        sasl.kerberos.principal.to.local.rules = [DEFAULT]
        fetch.purgatory.purge.interval.requests = 1000
        ssl.endpoint.identification.algorithm = null
        replica.socket.timeout.ms = 30000
        message.max.bytes = 1000012
        num.io.threads = 10
        offsets.commit.required.acks = -1
        log.flush.offset.checkpoint.interval.ms = 60000
        delete.topic.enable = true
        quota.window.size.seconds = 1
        ssl.truststore.type = JKS
        offsets.commit.timeout.ms = 5000
        quota.window.num = 11
        zookeeper.connect = 
10.148.27.151:2181,10.148.27.152:2181,10.148.27.153:2181
        authorizer.class.name =
        num.replica.fetchers = 1
        log.retention.ms = null
        log.roll.jitter.hours = 0
        log.cleaner.enable = true
        offsets.load.buffer.size = 5242880
        log.cleaner.delete.retention.ms = 86400000
        ssl.client.auth = none
        controlled.shutdown.max.retries = 3
        queued.max.requests = 500
        offsets.topic.replication.factor = 3
        log.cleaner.threads = 1
        sasl.kerberos.service.name = null
        sasl.kerberos.ticket.renew.jitter = 0.05
        socket.request.max.bytes = 104857600
        ssl.trustmanager.algorithm = PKIX
        zookeeper.session.timeout.ms = 6000
        log.retention.bytes = -1
        log.message.timestamp.type = CreateTime
        sasl.kerberos.min.time.before.relogin = 60000
        zookeeper.set.acl = false
        connections.max.idle.ms = 600000
        offsets.retention.minutes = 1440
        replica.fetch.backoff.ms = 1000
        inter.broker.protocol.version = 0.10.0-IV1
        log.retention.hours = 24
        num.partitions = 1
        broker.id.generation.enable = true
        listeners = PLAINTEXT://:9092,SSL://:9093
        ssl.provider = null
        ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
        log.roll.ms = null
        log.flush.scheduler.interval.ms = 9223372036854775807
        ssl.cipher.suites = null
        log.index.size.max.bytes = 10485760
        ssl.keymanager.algorithm = SunX509
        security.inter.broker.protocol = PLAINTEXT
        replica.fetch.max.bytes = 1048576
        advertised.port = null
        log.cleaner.dedupe.buffer.size = 134217728
        replica.high.watermark.checkpoint.interval.ms = 5000
        log.cleaner.io.buffer.size = 524288
        sasl.kerberos.ticket.renew.window.factor = 0.8
        zookeeper.connection.timeout.ms = 60000
        controlled.shutdown.retry.backoff.ms = 5000
        log.roll.hours = 168
        log.cleanup.policy = delete
        host.name =
        log.roll.jitter.ms = null
        max.connections.per.ip = 2147483647
        offsets.topic.segment.bytes = 104857600
        background.threads = 10
        quota.consumer.default = 9223372036854775807
        request.timeout.ms = 30000
        log.message.format.version = 0.10.0-IV1
        log.index.interval.bytes = 4096
        log.dir = /tmp/kafka-logs
        log.segment.bytes = 1073741824
        log.cleaner.backoff.ms = 15000
        offset.metadata.max.bytes = 4096
        ssl.truststore.location = /ssl/server.truststore.jks
        group.max.session.timeout.ms = 300000
        ssl.keystore.password = [hidden]
        zookeeper.sync.time.ms = 2000
        port = 9092
        log.retention.minutes = null
        log.segment.delete.delay.ms = 60000
        log.dirs = /data
        controlled.shutdown.enable = true
        compression.type = producer
        max.connections.per.ip.overrides =
        log.message.timestamp.difference.max.ms = 9223372036854775807
        sasl.kerberos.kinit.cmd = /usr/bin/kinit
        log.cleaner.io.max.bytes.per.second = 1.7976931348623157E308
        auto.leader.rebalance.enable = true
        leader.imbalance.check.interval.seconds = 300
        log.cleaner.min.cleanable.ratio = 0.5
        replica.lag.time.max.ms = 10000
        num.network.threads = 5
        ssl.key.password = [hidden]
        reserved.broker.max.id = 1000
        metrics.num.samples = 2
        socket.send.buffer.bytes = 102400
        ssl.protocol = TLS
        socket.receive.buffer.bytes = 102400
        ssl.keystore.location = /ssl/server.keystore.jks
        replica.fetch.min.bytes = 1
        broker.rack = null
        unclean.leader.election.enable = true
        sasl.enabled.mechanisms = [GSSAPI]
        group.min.session.timeout.ms = 6000
        log.cleaner.io.buffer.load.factor = 0.9
        offsets.retention.check.interval.ms = 600000
        producer.purgatory.purge.interval.requests = 1000
        metrics.sample.window.ms = 30000
        broker.id = 3
        offsets.topic.compression.codec = 0
        log.retention.check.interval.ms = 300000
        advertised.listeners = PLAINTEXT://:9092,SSL://:9093
        leader.imbalance.per.broker.percentage = 10
 (kafka.server.KafkaConfig)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to