Fwd: Producer acks=1, clean broker shutdown and data loss
I didn't get any hits on this on users@, so I'm forwarding this to dev@. Any explanation for this would be greatly appreciated! -- Forwarded message ------ From: Nick Travers Date: Sat, Feb 18, 2017 at 5:04 PM Subject: Producer acks=1, clean broker shutdown and data loss To: us...@kafka.apache.org Hi - I'm trying to understand the expected behavior of the scenario in which I have a producer with `acks=1` (i.e. partition leader acks only) and I cleanly shut down a broker (via `KafkaServer#shutdown`). I am running my test scenario with three brokers (0.10.1.1), with a default replication count and default partition count of three. The producer (also 0.10.1.1) is writing to a single topic. When I shut down one broker, I observe that in some instances there is data loss on the topic that the producer was sending to, around the instant where the broker was shut down and the producer sees that a new leader for one of the partitions has been re-elected. Reading the documentation [0], I see the following for the `acks=1` config: This will mean the leader will write the record to its local log but will > respond without awaiting full acknowledgement from all followers. In this > case should the leader fail immediately after acknowledging the record but > before the followers have replicated it then the record will be lost. While it is clear to me that in the case of a _failure_ of this broker that messages will be lost (as they have not yet been replicated to a follower), it isn't clear to me what the expected behavior is in the case where this broker is told to _cleanly_ shut down. I understand that having a setup where `replicas=3`, `min.insync.replicas=2` and `acks=-1` is much safer (and I've verified this works as expected for the same test scenario), I'd like to understand the semantics of the `acks=1` case nonetheless. Thanks in advance. - nick [0]: https://github.com/apache/kafka/blob/0.10.1.1/clients/ src/main/java/org/apache/kafka/clients/producer/ProducerConfig.java#L86
[jira] [Created] (KAFKA-4900) Brokers stuck in controller re-election loop after failing to register metrics
Nick Travers created KAFKA-4900: --- Summary: Brokers stuck in controller re-election loop after failing to register metrics Key: KAFKA-4900 URL: https://issues.apache.org/jira/browse/KAFKA-4900 Project: Kafka Issue Type: Bug Components: controller, core Affects Versions: 0.10.1.1 Reporter: Nick Travers We hit this today in one of out three node staging clusters. The exception continues to occur on all three nodes. {code} 2017-03-15 02:17:30,677 ERROR [ZkClient-EventThread-35-samsa-zkserver.stage.sjc1.square:26101/samsa] server.ZookeeperLeaderElector - Error while electing or becoming leader on broker 9 java.lang.IllegalArgumentException: A metric named 'MetricName [name=connection-close-rate, group=controller-channel-metrics,description=Connections closed per second in the window., tags={broker-id=10}]' already exists, can't register another one. at org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:380) at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:179) at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:164) at org.apache.kafka.common.network.Selector$SelectorMetrics.(Selector.java:617) at org.apache.kafka.common.network.Selector.(Selector.java:138) at kafka.controller.ControllerChannelManager.kafka$controller$ControllerChannelManager$$addNewBroker(ControllerChannelManager.scala:101) at kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45) at kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45) at scala.collection.immutable.Set$Set3.foreach(Set.scala:163) at kafka.controller.ControllerChannelManager.(ControllerChannelManager.scala:45) at kafka.controller.KafkaController.startChannelManager(KafkaController.scala:814) at kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:742) at kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:334) at kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:167) at kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84) at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146) at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141) at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141) at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234) at kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141) at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:824) at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) {code} We observe a tight loop of controller (re-)election, i.e. one nodes hits this exception, and leadership transitions to the next, which then hits the exception, ad infinitum. Producers and consumers appear to be connecting ok, and are able to produce and consume messages. Relevant data points: - prior to this cluster restart, a repartition of a number of topics was attempted, which appeared to get stuck in the "in progress" state (on the order of days) - these topics were subsequently deleted - a rolling restart of the cluster was performed was to turn on broker-to-broker SSL communication - the SSL change has subsequently been _rolled back_ after we observed these exceptions - the entire cluster was shut down, and nodes brought back one at a time in an attempt to clear the exception. We were able to restart the cluster, but we continue to see the exceptions We also observed, during the same time as the exception above, the following exception on all hosts: {code} 2017-03-15 01:44:04,572 ERROR [ZkClient-EventThread-36-samsa-zkserver.stage.sjc1.square:26101/samsa] controller.ReplicaStateMachine$BrokerChangeListener - [BrokerChangeListener on Controller 10]: Error while handling broker changes java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101) at kafka.controller.KafkaController$$anonfun$8$$anonfun$apply$2.apply(KafkaController.scala:436) at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93) at scala.collection.immutable.List.exists(List.scala:84) at kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:436) at kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:435) at scala.c
[jira] [Commented] (KAFKA-4900) Brokers stuck in controller re-election loop after failing to register metrics
[ https://issues.apache.org/jira/browse/KAFKA-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925560#comment-15925560 ] Nick Travers commented on KAFKA-4900: - [~onurkaraman] - good to know! Were you able to resolve the issue? Our cluster is up, but it's still in a pretty tight loop. > Brokers stuck in controller re-election loop after failing to register metrics > -- > > Key: KAFKA-4900 > URL: https://issues.apache.org/jira/browse/KAFKA-4900 > Project: Kafka > Issue Type: Bug > Components: controller, core >Affects Versions: 0.10.1.1 >Reporter: Nick Travers > > We hit this today in one of out three node staging clusters. The exception > continues to occur on all three nodes. > {code} > 2017-03-15 02:17:30,677 ERROR > [ZkClient-EventThread-35-samsa-zkserver.stage.sjc1.square:26101/samsa] > server.ZookeeperLeaderElector - Error while electing or becoming leader on > broker 9 > java.lang.IllegalArgumentException: A metric named 'MetricName > [name=connection-close-rate, > group=controller-channel-metrics,description=Connections closed per second in > the window., tags={broker-id=10}]' already exists, can't register another one. > at > org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:380) > at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:179) > at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:164) > at > org.apache.kafka.common.network.Selector$SelectorMetrics.(Selector.java:617) > at org.apache.kafka.common.network.Selector.(Selector.java:138) > at > kafka.controller.ControllerChannelManager.kafka$controller$ControllerChannelManager$$addNewBroker(ControllerChannelManager.scala:101) > at > kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45) > at > kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45) > at scala.collection.immutable.Set$Set3.foreach(Set.scala:163) > at > kafka.controller.ControllerChannelManager.(ControllerChannelManager.scala:45) > at > kafka.controller.KafkaController.startChannelManager(KafkaController.scala:814) > at > kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:742) > at > kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:334) > at > kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:167) > at > kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141) > at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:824) > at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) > {code} > We observe a tight loop of controller (re-)election, i.e. one nodes hits this > exception, and leadership transitions to the next, which then hits the > exception, ad infinitum. > Producers and consumers appear to be connecting ok, and are able to produce > and consume messages. > Relevant data points: > - prior to this cluster restart, a repartition of a number of topics was > attempted, which appeared to get stuck in the "in progress" state (on the > order of days) > - these topics were subsequently deleted > - a rolling restart of the cluster was performed was to turn on > broker-to-broker SSL communication > - the SSL change has subsequently been _rolled back_ after we observed these > exceptions > - the entire cluster was shut down, and nodes brought back one at a time in > an attempt to clear the exception. We were able to restart the cluster, but > we continue to see the exceptions > We also observed, during the same time as the exception above, the following > exception on all hosts: > {code} > 2017-03-15 01:44:04,572 ERROR > [ZkClient-EventThread-36-sam
[jira] [Updated] (KAFKA-4900) Brokers stuck in controller re-election loop after failing to register metrics
[ https://issues.apache.org/jira/browse/KAFKA-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Travers updated KAFKA-4900: Description: We hit this today in one of out three node staging clusters. The exception continues to occur on all three nodes. {code} 2017-03-15 02:17:30,677 ERROR [ZkClient-EventThread-35-samsa-zkserver.stage.sjc1.square:26101/samsa] server.ZookeeperLeaderElector - Error while electing or becoming leader on broker 9 java.lang.IllegalArgumentException: A metric named 'MetricName [name=connection-close-rate, group=controller-channel-metrics,description=Connections closed per second in the window., tags={broker-id=10}]' already exists, can't register another one. at org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:380) at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:179) at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:164) at org.apache.kafka.common.network.Selector$SelectorMetrics.(Selector.java:617) at org.apache.kafka.common.network.Selector.(Selector.java:138) at kafka.controller.ControllerChannelManager.kafka$controller$ControllerChannelManager$$addNewBroker(ControllerChannelManager.scala:101) at kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45) at kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45) at scala.collection.immutable.Set$Set3.foreach(Set.scala:163) at kafka.controller.ControllerChannelManager.(ControllerChannelManager.scala:45) at kafka.controller.KafkaController.startChannelManager(KafkaController.scala:814) at kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:742) at kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:334) at kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:167) at kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84) at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146) at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141) at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141) at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234) at kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141) at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:824) at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) {code} We observe a tight loop of controller (re-)election, i.e. one node hits this exception, and leadership transitions to the next, which then hits the exception, ad infinitum. Producers and consumers appear to be connecting ok, and are able to produce and consume messages. Relevant data points: - prior to this cluster restart, a partition reassignment was attempted for a number of topics, which appeared to get stuck in the "in progress" state (on the order of days) - these topics were subsequently deleted - a rolling restart of the cluster was performed was to turn on broker-to-broker SSL communication - the SSL change has subsequently been _rolled back_ after we observed these exceptions - the entire cluster was shut down, and nodes brought back one at a time in an attempt to clear the exception. We were able to restart the cluster, but we continue to see the exceptions We also observed, during the same time as the exception above, the following exception on all hosts: {code} 2017-03-15 01:44:04,572 ERROR [ZkClient-EventThread-36-samsa-zkserver.stage.sjc1.square:26101/samsa] controller.ReplicaStateMachine$BrokerChangeListener - [BrokerChangeListener on Controller 10]: Error while handling broker changes java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101) at kafka.controller.KafkaController$$anonfun$8$$anonfun$apply$2.apply(KafkaController.scala:436) at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93) at scala.collection.immutable.List.exists(List.scala:84) at kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:436) at kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:435) at scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashMap
[jira] [Created] (KAFKA-4914) Partition re-assignment tool should check types before persisting state in ZooKeeper
Nick Travers created KAFKA-4914: --- Summary: Partition re-assignment tool should check types before persisting state in ZooKeeper Key: KAFKA-4914 URL: https://issues.apache.org/jira/browse/KAFKA-4914 Project: Kafka Issue Type: Improvement Components: admin Affects Versions: 0.10.1.1 Reporter: Nick Travers Priority: Minor The partition-reassignment too currently allows non-type-safe information to be persisted into ZooKeeper, which can result in a ClassCastException at runtime for brokers. Specifically, this occurred when the broker assignment field was a List of Strings, instead of a List of Integers. {code} 2017-03-15 01:44:04,572 ERROR [ZkClient-EventThread-36-samsa-zkserver.stage.sjc1.square:26101/samsa] controller.ReplicaStateMachine$BrokerChangeListener - [BrokerChangeListener on Controller 10]: Error while handling broker changes java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101) at kafka.controller.KafkaController$$anonfun$8$$anonfun$apply$2.apply(KafkaController.scala:436) at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93) at scala.collection.immutable.List.exists(List.scala:84) at kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:436) at kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:435) at scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247) at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259) at scala.collection.AbstractTraversable.filter(Traversable.scala:104) at kafka.controller.KafkaController.onBrokerStartup(KafkaController.scala:435) at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ReplicaStateMachine.scala:374) at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:358) at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:358) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(ReplicaStateMachine.scala:357) at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:356) at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:356) at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234) at kafka.controller.ReplicaStateMachine$BrokerChangeListener.handleChildChange(ReplicaStateMachine.scala:355) at org.I0Itec.zkclient.ZkClient$10.run(ZkClient.java:843) at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-4900) Brokers stuck in controller re-election loop after failing to register metrics
[ https://issues.apache.org/jira/browse/KAFKA-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929381#comment-15929381 ] Nick Travers commented on KAFKA-4900: - Circling back on this, now that we (seemingly) have a root-cause on our end for the issues we saw. It appears that while running the partition re-assignment tool, we persisted state in ZK that violated the type constraints on certain items in the re-assignment json (specifically, a List of Strings instead of a List of Integers for the broker assignment). I've created KAFKA-4914 to improve this. This initially threw a ClassCastException, as you'd expect, on each broker at startup time. I assume this then lead to a situation where metrics had already been registered for the broker. Part of the controller (re-)election process is registering metrics for the controller. If there are already metrics registered, an IllegalArgumentException is thrown, crashing the (re-)election method, which triggers the same issue on the next elected broker, etc. etc. I'm unsure exactly how the first exception triggered the second, but the ordering is as stated. ClassCastException first, and eventually an IllegalArgumentException. Could one potential solution here be to guard against re-registering metrics by first checking to see if the metrics are registered and then taking appropriate action (i.e. create or overwrite the metric)?. It feels a little heavy handed to throw an exception deep within the controller election process due to a metrics issue. This resulted in the whole cluster falling into a sustained controller re-election spin. > Brokers stuck in controller re-election loop after failing to register metrics > -- > > Key: KAFKA-4900 > URL: https://issues.apache.org/jira/browse/KAFKA-4900 > Project: Kafka > Issue Type: Bug > Components: controller, core >Affects Versions: 0.10.1.1 >Reporter: Nick Travers > > We hit this today in one of out three node staging clusters. The exception > continues to occur on all three nodes. > {code} > 2017-03-15 02:17:30,677 ERROR > [ZkClient-EventThread-35-samsa-zkserver.stage.sjc1.square:26101/samsa] > server.ZookeeperLeaderElector - Error while electing or becoming leader on > broker 9 > java.lang.IllegalArgumentException: A metric named 'MetricName > [name=connection-close-rate, > group=controller-channel-metrics,description=Connections closed per second in > the window., tags={broker-id=10}]' already exists, can't register another one. > at > org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:380) > at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:179) > at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:164) > at > org.apache.kafka.common.network.Selector$SelectorMetrics.(Selector.java:617) > at org.apache.kafka.common.network.Selector.(Selector.java:138) > at > kafka.controller.ControllerChannelManager.kafka$controller$ControllerChannelManager$$addNewBroker(ControllerChannelManager.scala:101) > at > kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45) > at > kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45) > at scala.collection.immutable.Set$Set3.foreach(Set.scala:163) > at > kafka.controller.ControllerChannelManager.(ControllerChannelManager.scala:45) > at > kafka.controller.KafkaController.startChannelManager(KafkaController.scala:814) > at > kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:742) > at > kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:334) > at > kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:167) > at > kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141) >
[jira] [Updated] (KAFKA-4914) Partition re-assignment tool should check types before persisting state in ZooKeeper
[ https://issues.apache.org/jira/browse/KAFKA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Travers updated KAFKA-4914: Status: Patch Available (was: Open) > Partition re-assignment tool should check types before persisting state in > ZooKeeper > > > Key: KAFKA-4914 > URL: https://issues.apache.org/jira/browse/KAFKA-4914 > Project: Kafka > Issue Type: Improvement > Components: admin >Affects Versions: 0.10.1.1 > Reporter: Nick Travers >Priority: Minor > > The partition-reassignment too currently allows non-type-safe information to > be persisted into ZooKeeper, which can result in a ClassCastException at > runtime for brokers. > Specifically, this occurred when the broker assignment field was a List of > Strings, instead of a List of Integers. > {code} > 2017-03-15 01:44:04,572 ERROR > [ZkClient-EventThread-36-samsa-zkserver.stage.sjc1.square:26101/samsa] > controller.ReplicaStateMachine$BrokerChangeListener - [BrokerChangeListener > on Controller 10]: Error while handling broker changes > java.lang.ClassCastException: java.lang.String cannot be cast to > java.lang.Integer > at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101) > at > kafka.controller.KafkaController$$anonfun$8$$anonfun$apply$2.apply(KafkaController.scala:436) > at > scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93) > at scala.collection.immutable.List.exists(List.scala:84) > at > kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:436) > at > kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:435) > at > scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) > at > scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247) > at > scala.collection.TraversableLike$class.filter(TraversableLike.scala:259) > at scala.collection.AbstractTraversable.filter(Traversable.scala:104) > at > kafka.controller.KafkaController.onBrokerStartup(KafkaController.scala:435) > at > kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ReplicaStateMachine.scala:374) > at > kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:358) > at > kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:358) > at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) > at > kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(ReplicaStateMachine.scala:357) > at > kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:356) > at > kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:356) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234) > at > kafka.controller.ReplicaStateMachine$BrokerChangeListener.handleChildChange(ReplicaStateMachine.scala:355) > at org.I0Itec.zkclient.ZkClient$10.run(ZkClient.java:843) > at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-4900) Brokers stuck in controller re-election loop after failing to register metrics
[ https://issues.apache.org/jira/browse/KAFKA-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15942542#comment-15942542 ] Nick Travers commented on KAFKA-4900: - [~onurkaraman] - I like the sound of this theory. While I don't think we hit _exactly_ the same concurrency bug heading _into_ the controller re-election loop, I think that what you outline in terms of hitting the metrics issue sounds reasonable. In our case, we hit a {{ClassCastException}} on controller (re-)election, which most likely resulted in the metrics (amongst other things) not being cleaned up. This also makes sense with what we observed, as each broker hit the same {{ClassCastException}} (as a result of KAFKA-4914), which left each broker with controller metrics that were not cleaned up. It sounds like a potential solution here is to fix both of the underlying issues that caused the brokers to leave "dirty state" lying around when the controller election process failed (partially tracked by KAFKA-4447 and KAFKA-4914), as well as the issue of the controller not cleaning up the state itself. > Brokers stuck in controller re-election loop after failing to register metrics > -- > > Key: KAFKA-4900 > URL: https://issues.apache.org/jira/browse/KAFKA-4900 > Project: Kafka > Issue Type: Bug > Components: controller, core >Affects Versions: 0.10.1.1 >Reporter: Nick Travers > > We hit this today in one of out three node staging clusters. The exception > continues to occur on all three nodes. > {code} > 2017-03-15 02:17:30,677 ERROR > [ZkClient-EventThread-35-samsa-zkserver.stage.sjc1.square:26101/samsa] > server.ZookeeperLeaderElector - Error while electing or becoming leader on > broker 9 > java.lang.IllegalArgumentException: A metric named 'MetricName > [name=connection-close-rate, > group=controller-channel-metrics,description=Connections closed per second in > the window., tags={broker-id=10}]' already exists, can't register another one. > at > org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:380) > at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:179) > at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:164) > at > org.apache.kafka.common.network.Selector$SelectorMetrics.(Selector.java:617) > at org.apache.kafka.common.network.Selector.(Selector.java:138) > at > kafka.controller.ControllerChannelManager.kafka$controller$ControllerChannelManager$$addNewBroker(ControllerChannelManager.scala:101) > at > kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45) > at > kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45) > at scala.collection.immutable.Set$Set3.foreach(Set.scala:163) > at > kafka.controller.ControllerChannelManager.(ControllerChannelManager.scala:45) > at > kafka.controller.KafkaController.startChannelManager(KafkaController.scala:814) > at > kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:742) > at > kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:334) > at > kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:167) > at > kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141) > at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:824) > at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) > {code} > We observe a tight loop of controller (re-)election, i.e. one node hits this > exception, and leadership transitions to the next, which then hits the > exception, ad infinitum. > Producers and consumers appear to be connecting ok, and are able to produce > and consume messages. > Relevant data points: > - prior to this cluster restart, a partition reassig
[jira] [Commented] (KAFKA-4900) Brokers stuck in controller re-election loop after failing to register metrics
[ https://issues.apache.org/jira/browse/KAFKA-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15961885#comment-15961885 ] Nick Travers commented on KAFKA-4900: - [~onurkaraman] - I think so. This ticket seems to deal more with the issue of the controller re-election failing due to metrics already existing. > Brokers stuck in controller re-election loop after failing to register metrics > -- > > Key: KAFKA-4900 > URL: https://issues.apache.org/jira/browse/KAFKA-4900 > Project: Kafka > Issue Type: Bug > Components: controller, core >Affects Versions: 0.10.1.1 >Reporter: Nick Travers > > We hit this today in one of out three node staging clusters. The exception > continues to occur on all three nodes. > {code} > 2017-03-15 02:17:30,677 ERROR > [ZkClient-EventThread-35-samsa-zkserver.stage.sjc1.square:26101/samsa] > server.ZookeeperLeaderElector - Error while electing or becoming leader on > broker 9 > java.lang.IllegalArgumentException: A metric named 'MetricName > [name=connection-close-rate, > group=controller-channel-metrics,description=Connections closed per second in > the window., tags={broker-id=10}]' already exists, can't register another one. > at > org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:380) > at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:179) > at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:164) > at > org.apache.kafka.common.network.Selector$SelectorMetrics.(Selector.java:617) > at org.apache.kafka.common.network.Selector.(Selector.java:138) > at > kafka.controller.ControllerChannelManager.kafka$controller$ControllerChannelManager$$addNewBroker(ControllerChannelManager.scala:101) > at > kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45) > at > kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45) > at scala.collection.immutable.Set$Set3.foreach(Set.scala:163) > at > kafka.controller.ControllerChannelManager.(ControllerChannelManager.scala:45) > at > kafka.controller.KafkaController.startChannelManager(KafkaController.scala:814) > at > kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:742) > at > kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:334) > at > kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:167) > at > kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141) > at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:824) > at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) > {code} > We observe a tight loop of controller (re-)election, i.e. one node hits this > exception, and leadership transitions to the next, which then hits the > exception, ad infinitum. > Producers and consumers appear to be connecting ok, and are able to produce > and consume messages. > Relevant data points: > - prior to this cluster restart, a partition reassignment was attempted for > a number of topics, which appeared to get stuck in the "in progress" state > (on the order of days) > - these topics were subsequently deleted > - a rolling restart of the cluster was performed was to turn on > broker-to-broker SSL communication > - the SSL change has subsequently been _rolled back_ after we observed these > exceptions > - the entire cluster was shut down, and nodes brought back one at a time in > an attempt to clear the exception. We were able to restart the cluster, but > we continue to see the exceptions > We also observed, during the same time as the exception above, the following > exception on all hosts: > {code} > 2017-03-15 01:44:04,572 ERROR > [ZkClient-EventThrea
[jira] [Created] (KAFKA-4681) Disallow reassign-partitions script to assign partitions to non-existent brokers
Nick Travers created KAFKA-4681: --- Summary: Disallow reassign-partitions script to assign partitions to non-existent brokers Key: KAFKA-4681 URL: https://issues.apache.org/jira/browse/KAFKA-4681 Project: Kafka Issue Type: Improvement Reporter: Nick Travers Priority: Minor When running the {{kafka-reassign-partitions.sh}} script, it is possible for partitions to be accidentally assigned to brokers that do not exist. This results in partitions with an indefinitely reduced ISR set, as the partition can never be replicated to the non-existent broker. The solution was to add a new broker with the "bogus" broker ID, reassign the partitions to other brokers, before decommissioning the bogus broker. This is not ideal due to the increased operational burden of adding and removing brokers, in addition to manually moving partitions around. Suggest patching the script to disallow assignment of partitions to brokers that either do not exist, or brokers that are dead, and not participating in the ISR of other partitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-4698) `-target:jvm-1.7` in scalaCompileOptions seems to cause java.lang.VerifyError when loading kafka/client/ClientUtils
[ https://issues.apache.org/jira/browse/KAFKA-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838978#comment-15838978 ] Nick Travers commented on KAFKA-4698: - [~ijuma] [~hachikuji] - any chance we can get this into 0.10.1.2 (if that's going to be a release) as a bugfix / hotfix? > `-target:jvm-1.7` in scalaCompileOptions seems to cause java.lang.VerifyError > when loading kafka/client/ClientUtils > > > Key: KAFKA-4698 > URL: https://issues.apache.org/jira/browse/KAFKA-4698 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.10.1.1 >Reporter: Chris Heisterkamp >Assignee: Ismael Juma > Fix For: 0.10.2.0 > > > When we updated to kafka 2.11-0.10.1.1 from 2.11-0.8.2.2 we started getting > the following error > {noformat} > java.lang.VerifyError: Stack map does not match the one at exception handler > 190 > Exception Details: > Location: > > kafka/client/ClientUtils$.fetchTopicMetadata(Lscala/collection/Set;Lscala/collection/Seq;Lkafka/producer/ProducerConfig;I)Lkafka/api/TopicMetadataResponse; > @190: astore > Reason: > Type top (current frame, locals[12]) is not assignable to > 'kafka/producer/SyncProducer' (stack map, locals[12]) > Current Frame: > bci: @63 > flags: { } > locals: { 'kafka/client/ClientUtils$', 'scala/collection/Set', > 'scala/collection/Seq', 'kafka/producer/ProducerConfig', integer, integer, > 'scala/runtime/IntRef', 'kafka/api/TopicMetadataRequest', > 'kafka/api/TopicMetadataResponse', 'java/lang/Throwable', > 'scala/collection/Seq', 'kafka/producer/SyncProducer' } > stack: { 'java/lang/Throwable' } > Stackmap Frame: > bci: @190 > flags: { } > locals: { 'kafka/client/ClientUtils$', 'scala/collection/Set', > 'scala/collection/Seq', 'kafka/producer/ProducerConfig', integer, integer, > 'scala/runtime/IntRef', 'kafka/api/TopicMetadataRequest', > 'kafka/api/TopicMetadataResponse', 'java/lang/Throwable', > 'scala/collection/Seq', top, 'kafka/producer/SyncProducer' } > stack: { 'java/lang/Throwable' } > Bytecode: > 0x000: 0336 0503 b800 aa3a 06bb 00ac 59b2 00b1 > 0x010: b600 b515 042d b600 ba2b b900 c001 00b7 > 0x020: 00c3 3a07 013a 0801 3a09 b200 c82c b200 > 0x030: cdb6 00d1 b600 d5c0 00d7 3a0a a700 353a > 0x040: 0b2a bb00 0b59 2b15 0419 0619 0ab7 00de > 0x050: bb00 0d59 190b b700 e1b6 00e3 190b 3a09 > 0x060: 1906 1906 b400 e704 60b5 00e7 190c b600 > 0x070: ea19 06b4 00e7 190a b900 ee01 00a2 0057 > 0x080: 1505 9a00 52b2 00f3 2d19 0a19 06b4 00e7 > 0x090: b900 f702 00c0 00f9 b600 fd3a 0c2a bb00 > 0x0a0: 0f59 2b15 0419 0619 0ab7 00fe b601 0019 > 0x0b0: 0c19 07b6 0104 3a08 0436 05a7 ffa5 3a0d > 0x0c0: 1906 1906 b400 e704 60b5 00e7 190c b600 > 0x0d0: ea19 0dbf 1505 9900 122a bb00 1159 2bb7 > 0x0e0: 0107 b601 0919 08b0 bb01 0b59 bb01 0d59 > 0x0f0: b201 1213 0114 b601 18b7 011a b201 1205 > 0x100: bd00 0459 032b 5359 0419 0a53 b601 1eb6 > 0x110: 0122 1909 b701 25bf > Exception Handler Table: > bci [175, 190] => handler: 63 > bci [175, 190] => handler: 190 > bci [63, 96] => handler: 190 > Stackmap Table: > > full_frame(@63,{Object[#2],Object[#188],Object[#215],Object[#183],Integer,Integer,Object[#166],Object[#172],Object[#217],Object[#72],Object[#215],Object[#219]},{Object[#72]}) > > full_frame(@96,{Object[#2],Object[#188],Object[#215],Object[#183],Integer,Integer,Object[#166],Object[#172],Object[#217],Object[#72],Object[#215],Top,Object[#219]},{}) > chop_frame(@113,2) > > full_frame(@190,{Object[#2],Object[#188],Object[#215],Object[#183],Integer,Integer,Object[#166],Object[#172],Object[#217],Object[#72],Object[#215],Top,Object[#219]},{Object[#72]}) > chop_frame(@212,2) > same_frame(@232) > > at > kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:65) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63) > {noformat} > Which looks like it's coming from the change in > https://issues.apache.org/jira/browse/KAFKA-3375. The suggested fix was to > rebuild the
[jira] [Created] (KAFKA-5236) Regression in on-disk log size when using Snappy compression with 0.8.2 log message format
Nick Travers created KAFKA-5236: --- Summary: Regression in on-disk log size when using Snappy compression with 0.8.2 log message format Key: KAFKA-5236 URL: https://issues.apache.org/jira/browse/KAFKA-5236 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.10.2.1 Reporter: Nick Travers We recently upgraded our brokers in our production environments from 0.10.1.1 to 0.10.2.1 and we've noticed a sizable regression in the on-disk .log file size. For some deployments the increase was as much as 50%. We run our brokers with the 0.8.2 log message format version. The majority of our message volume comes from 0.10.x Java clients sending messages encoded with the Snappy codec. Some initial testing only shows a regression between the two versions when using Snappy compression with a log message format of 0.8.2. I also tested 0.10.x log message formats as well as Gzip compression. The log sizes do not differ in this case, so the issue seems confined to 0.8.2 message format and Snappy compression. A git-bisect lead me to this commit, which modified the server-side implementation of `Record`: https://github.com/apache/kafka/commit/67f1e5b91bf073151ff57d5d656693e385726697 Here's the PR, which has more context: https://github.com/apache/kafka/pull/2140 Here is a link to the test I used to re-producer this issue: https://github.com/nicktrav/kafka/commit/68e8db4fa525e173651ac740edb270b0d90b8818 cc: [~hachikuji] [~junrao] [~ijuma] [~guozhang] (tagged on original PR) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-5236) Regression in on-disk log size when using Snappy compression with 0.8.2 log message format
[ https://issues.apache.org/jira/browse/KAFKA-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035404#comment-16035404 ] Nick Travers commented on KAFKA-5236: - Thanks for the patch [~ijuma]! I ran the same tests I used originally and confirmed that there was no regression in the on-disk size. > Regression in on-disk log size when using Snappy compression with 0.8.2 log > message format > -- > > Key: KAFKA-5236 > URL: https://issues.apache.org/jira/browse/KAFKA-5236 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.10.2.1 >Reporter: Nick Travers >Assignee: Ismael Juma > Labels: regression > Fix For: 0.11.0.0 > > > We recently upgraded our brokers in our production environments from 0.10.1.1 > to 0.10.2.1 and we've noticed a sizable regression in the on-disk .log file > size. For some deployments the increase was as much as 50%. > We run our brokers with the 0.8.2 log message format version. The majority of > our message volume comes from 0.10.x Java clients sending messages encoded > with the Snappy codec. > Some initial testing only shows a regression between the two versions when > using Snappy compression with a log message format of 0.8.2. > I also tested 0.10.x log message formats as well as Gzip compression. The log > sizes do not differ in this case, so the issue seems confined to 0.8.2 > message format and Snappy compression. > A git-bisect lead me to this commit, which modified the server-side > implementation of `Record`: > https://github.com/apache/kafka/commit/67f1e5b91bf073151ff57d5d656693e385726697 > Here's the PR, which has more context: > https://github.com/apache/kafka/pull/2140 > Here is a link to the test I used to re-producer this issue: > https://github.com/nicktrav/kafka/commit/68e8db4fa525e173651ac740edb270b0d90b8818 > cc: [~hachikuji] [~junrao] [~ijuma] [~guozhang] (tagged on original PR) -- This message was sent by Atlassian JIRA (v6.3.15#6346)