Fwd: Producer acks=1, clean broker shutdown and data loss

2017-02-22 Thread Nick Travers
I didn't get any hits on this on users@, so I'm forwarding this to dev@.
Any explanation for this would be greatly appreciated!

-- Forwarded message ------
From: Nick Travers 
Date: Sat, Feb 18, 2017 at 5:04 PM
Subject: Producer acks=1, clean broker shutdown and data loss
To: us...@kafka.apache.org


Hi - I'm trying to understand the expected behavior of the scenario in
which I have a producer with `acks=1` (i.e. partition leader acks only) and
I cleanly shut down a broker (via `KafkaServer#shutdown`).

I am running my test scenario with three brokers (0.10.1.1), with a default
replication count and default partition count of three. The producer (also
0.10.1.1) is writing to a single topic.

When I shut down one broker, I observe that in some instances there is data
loss on the topic that the producer was sending to, around the instant
where the broker was shut down and the producer sees that a new leader for
one of the partitions has been re-elected.

Reading the documentation [0], I see the following for the `acks=1` config:

This will mean the leader will write the record to its local log but will
> respond without awaiting full acknowledgement from all followers. In this
> case should the leader fail immediately after acknowledging the record but
> before the followers have replicated it then the record will be lost.


While it is clear to me that in the case of a _failure_ of this broker that
messages will be lost (as they have not yet been replicated to a follower),
it isn't clear to me what the expected behavior is in the case where this
broker is told to _cleanly_ shut down.

I understand that having a setup where `replicas=3`,
`min.insync.replicas=2` and `acks=-1` is much safer (and I've verified this
works as expected for the same test scenario), I'd like to understand the
semantics of the `acks=1` case nonetheless.

Thanks in advance.
- nick

[0]: https://github.com/apache/kafka/blob/0.10.1.1/clients/
src/main/java/org/apache/kafka/clients/producer/ProducerConfig.java#L86


[jira] [Created] (KAFKA-4900) Brokers stuck in controller re-election loop after failing to register metrics

2017-03-14 Thread Nick Travers (JIRA)
Nick Travers created KAFKA-4900:
---

 Summary: Brokers stuck in controller re-election loop after 
failing to register metrics
 Key: KAFKA-4900
 URL: https://issues.apache.org/jira/browse/KAFKA-4900
 Project: Kafka
  Issue Type: Bug
  Components: controller, core
Affects Versions: 0.10.1.1
Reporter: Nick Travers


We hit this today in one of out three node staging clusters. The exception 
continues to occur on all three nodes.

{code}
2017-03-15 02:17:30,677 ERROR 
[ZkClient-EventThread-35-samsa-zkserver.stage.sjc1.square:26101/samsa] 
server.ZookeeperLeaderElector - Error while electing or becoming leader on 
broker 9
java.lang.IllegalArgumentException: A metric named 'MetricName 
[name=connection-close-rate, 
group=controller-channel-metrics,description=Connections closed per second in 
the window., tags={broker-id=10}]' already exists, can't register another one.
at 
org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:380)
at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:179)
at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:164)
at 
org.apache.kafka.common.network.Selector$SelectorMetrics.(Selector.java:617)
at org.apache.kafka.common.network.Selector.(Selector.java:138)
at 
kafka.controller.ControllerChannelManager.kafka$controller$ControllerChannelManager$$addNewBroker(ControllerChannelManager.scala:101)
at 
kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
at 
kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
at scala.collection.immutable.Set$Set3.foreach(Set.scala:163)
at 
kafka.controller.ControllerChannelManager.(ControllerChannelManager.scala:45)
at 
kafka.controller.KafkaController.startChannelManager(KafkaController.scala:814)
at 
kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:742)
at 
kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:334)
at 
kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:167)
at 
kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84)
at 
kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146)
at 
kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
at 
kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
at 
kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141)
at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:824)
at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) 
{code}

We observe a tight loop of controller (re-)election, i.e. one nodes hits this 
exception, and leadership transitions to the next, which then hits the 
exception, ad infinitum.

Producers and consumers appear to be connecting ok, and are able to produce and 
consume messages.

Relevant data points:
 - prior to this cluster restart, a repartition of a number of topics was 
attempted, which appeared to get stuck in the "in progress" state (on the order 
of days)
 - these topics were subsequently deleted
 - a rolling restart of the cluster was performed was to turn on 
broker-to-broker SSL communication
 - the SSL change has subsequently been _rolled back_ after we observed these 
exceptions
 - the entire cluster was shut down, and nodes brought back one at a time in an 
attempt to clear the exception. We were able to restart the cluster, but we 
continue to see the exceptions

We also observed, during the same time as the exception above, the following 
exception on all hosts:

{code}
2017-03-15 01:44:04,572 ERROR 
[ZkClient-EventThread-36-samsa-zkserver.stage.sjc1.square:26101/samsa] 
controller.ReplicaStateMachine$BrokerChangeListener - [BrokerChangeListener on 
Controller 10]: Error while handling broker changes
java.lang.ClassCastException: java.lang.String cannot be cast to 
java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at 
kafka.controller.KafkaController$$anonfun$8$$anonfun$apply$2.apply(KafkaController.scala:436)
at 
scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93)
at scala.collection.immutable.List.exists(List.scala:84)
at 
kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:436)
at 
kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:435)
at 
scala.c

[jira] [Commented] (KAFKA-4900) Brokers stuck in controller re-election loop after failing to register metrics

2017-03-14 Thread Nick Travers (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925560#comment-15925560
 ] 

Nick Travers commented on KAFKA-4900:
-

[~onurkaraman] - good to know! Were you able to resolve the issue? Our cluster 
is up, but it's still in a pretty tight loop.

> Brokers stuck in controller re-election loop after failing to register metrics
> --
>
> Key: KAFKA-4900
> URL: https://issues.apache.org/jira/browse/KAFKA-4900
> Project: Kafka
>  Issue Type: Bug
>  Components: controller, core
>Affects Versions: 0.10.1.1
>Reporter: Nick Travers
>
> We hit this today in one of out three node staging clusters. The exception 
> continues to occur on all three nodes.
> {code}
> 2017-03-15 02:17:30,677 ERROR 
> [ZkClient-EventThread-35-samsa-zkserver.stage.sjc1.square:26101/samsa] 
> server.ZookeeperLeaderElector - Error while electing or becoming leader on 
> broker 9
> java.lang.IllegalArgumentException: A metric named 'MetricName 
> [name=connection-close-rate, 
> group=controller-channel-metrics,description=Connections closed per second in 
> the window., tags={broker-id=10}]' already exists, can't register another one.
> at 
> org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:380)
> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:179)
> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:164)
> at 
> org.apache.kafka.common.network.Selector$SelectorMetrics.(Selector.java:617)
> at org.apache.kafka.common.network.Selector.(Selector.java:138)
> at 
> kafka.controller.ControllerChannelManager.kafka$controller$ControllerChannelManager$$addNewBroker(ControllerChannelManager.scala:101)
> at 
> kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
> at 
> kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
> at scala.collection.immutable.Set$Set3.foreach(Set.scala:163)
> at 
> kafka.controller.ControllerChannelManager.(ControllerChannelManager.scala:45)
> at 
> kafka.controller.KafkaController.startChannelManager(KafkaController.scala:814)
> at 
> kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:742)
> at 
> kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:334)
> at 
> kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:167)
> at 
> kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141)
> at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:824)
> at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) 
> {code}
> We observe a tight loop of controller (re-)election, i.e. one nodes hits this 
> exception, and leadership transitions to the next, which then hits the 
> exception, ad infinitum.
> Producers and consumers appear to be connecting ok, and are able to produce 
> and consume messages.
> Relevant data points:
>  - prior to this cluster restart, a repartition of a number of topics was 
> attempted, which appeared to get stuck in the "in progress" state (on the 
> order of days)
>  - these topics were subsequently deleted
>  - a rolling restart of the cluster was performed was to turn on 
> broker-to-broker SSL communication
>  - the SSL change has subsequently been _rolled back_ after we observed these 
> exceptions
>  - the entire cluster was shut down, and nodes brought back one at a time in 
> an attempt to clear the exception. We were able to restart the cluster, but 
> we continue to see the exceptions
> We also observed, during the same time as the exception above, the following 
> exception on all hosts:
> {code}
> 2017-03-15 01:44:04,572 ERROR 
> [ZkClient-EventThread-36-sam

[jira] [Updated] (KAFKA-4900) Brokers stuck in controller re-election loop after failing to register metrics

2017-03-14 Thread Nick Travers (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Travers updated KAFKA-4900:

Description: 
We hit this today in one of out three node staging clusters. The exception 
continues to occur on all three nodes.

{code}
2017-03-15 02:17:30,677 ERROR 
[ZkClient-EventThread-35-samsa-zkserver.stage.sjc1.square:26101/samsa] 
server.ZookeeperLeaderElector - Error while electing or becoming leader on 
broker 9
java.lang.IllegalArgumentException: A metric named 'MetricName 
[name=connection-close-rate, 
group=controller-channel-metrics,description=Connections closed per second in 
the window., tags={broker-id=10}]' already exists, can't register another one.
at 
org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:380)
at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:179)
at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:164)
at 
org.apache.kafka.common.network.Selector$SelectorMetrics.(Selector.java:617)
at org.apache.kafka.common.network.Selector.(Selector.java:138)
at 
kafka.controller.ControllerChannelManager.kafka$controller$ControllerChannelManager$$addNewBroker(ControllerChannelManager.scala:101)
at 
kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
at 
kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
at scala.collection.immutable.Set$Set3.foreach(Set.scala:163)
at 
kafka.controller.ControllerChannelManager.(ControllerChannelManager.scala:45)
at 
kafka.controller.KafkaController.startChannelManager(KafkaController.scala:814)
at 
kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:742)
at 
kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:334)
at 
kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:167)
at 
kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84)
at 
kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146)
at 
kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
at 
kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
at 
kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141)
at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:824)
at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) 
{code}

We observe a tight loop of controller (re-)election, i.e. one node hits this 
exception, and leadership transitions to the next, which then hits the 
exception, ad infinitum.

Producers and consumers appear to be connecting ok, and are able to produce and 
consume messages.

Relevant data points:
 - prior to this cluster restart, a partition reassignment was attempted for a 
number of topics, which appeared to get stuck in the "in progress" state (on 
the order of days)
 - these topics were subsequently deleted
 - a rolling restart of the cluster was performed was to turn on 
broker-to-broker SSL communication
 - the SSL change has subsequently been _rolled back_ after we observed these 
exceptions
 - the entire cluster was shut down, and nodes brought back one at a time in an 
attempt to clear the exception. We were able to restart the cluster, but we 
continue to see the exceptions

We also observed, during the same time as the exception above, the following 
exception on all hosts:

{code}
2017-03-15 01:44:04,572 ERROR 
[ZkClient-EventThread-36-samsa-zkserver.stage.sjc1.square:26101/samsa] 
controller.ReplicaStateMachine$BrokerChangeListener - [BrokerChangeListener on 
Controller 10]: Error while handling broker changes
java.lang.ClassCastException: java.lang.String cannot be cast to 
java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at 
kafka.controller.KafkaController$$anonfun$8$$anonfun$apply$2.apply(KafkaController.scala:436)
at 
scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93)
at scala.collection.immutable.List.exists(List.scala:84)
at 
kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:436)
at 
kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:435)
at 
scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at 
scala.collection.mutable.HashMap

[jira] [Created] (KAFKA-4914) Partition re-assignment tool should check types before persisting state in ZooKeeper

2017-03-16 Thread Nick Travers (JIRA)
Nick Travers created KAFKA-4914:
---

 Summary: Partition re-assignment tool should check types before 
persisting state in ZooKeeper
 Key: KAFKA-4914
 URL: https://issues.apache.org/jira/browse/KAFKA-4914
 Project: Kafka
  Issue Type: Improvement
  Components: admin
Affects Versions: 0.10.1.1
Reporter: Nick Travers
Priority: Minor


The partition-reassignment too currently allows non-type-safe information to be 
persisted into ZooKeeper, which can result in a ClassCastException at runtime 
for brokers.

Specifically, this occurred when the broker assignment field was a List of 
Strings, instead of a List of Integers.

{code}
2017-03-15 01:44:04,572 ERROR 
[ZkClient-EventThread-36-samsa-zkserver.stage.sjc1.square:26101/samsa] 
controller.ReplicaStateMachine$BrokerChangeListener - [BrokerChangeListener on 
Controller 10]: Error while handling broker changes
java.lang.ClassCastException: java.lang.String cannot be cast to 
java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at 
kafka.controller.KafkaController$$anonfun$8$$anonfun$apply$2.apply(KafkaController.scala:436)
at 
scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93)
at scala.collection.immutable.List.exists(List.scala:84)
at 
kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:436)
at 
kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:435)
at 
scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at 
scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
at 
scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
at 
kafka.controller.KafkaController.onBrokerStartup(KafkaController.scala:435)
at 
kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ReplicaStateMachine.scala:374)
at 
kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:358)
at 
kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:358)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at 
kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(ReplicaStateMachine.scala:357)
at 
kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:356)
at 
kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:356)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
at 
kafka.controller.ReplicaStateMachine$BrokerChangeListener.handleChildChange(ReplicaStateMachine.scala:355)
at org.I0Itec.zkclient.ZkClient$10.run(ZkClient.java:843)
at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-4900) Brokers stuck in controller re-election loop after failing to register metrics

2017-03-16 Thread Nick Travers (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929381#comment-15929381
 ] 

Nick Travers commented on KAFKA-4900:
-

Circling back on this, now that we (seemingly) have a root-cause on our end for 
the issues we saw.

It appears that while running the partition re-assignment tool, we persisted 
state in ZK that violated the type constraints on certain items in the 
re-assignment json (specifically, a List of Strings instead of a List of 
Integers for the broker assignment). I've created KAFKA-4914 to improve this.

This initially threw a ClassCastException, as you'd expect, on each broker at 
startup time.

I assume this then lead to a situation where metrics had already been 
registered for the broker. Part of the controller (re-)election process is 
registering metrics for the controller. If there are already metrics 
registered, an IllegalArgumentException is thrown, crashing the (re-)election 
method, which triggers the same issue on the next elected broker, etc. etc.

I'm unsure exactly how the first exception triggered the second, but the 
ordering is as stated. ClassCastException first, and eventually an 
IllegalArgumentException.

Could one potential solution here be to guard against re-registering metrics by 
first checking to see if the metrics are registered and then taking appropriate 
action (i.e. create or overwrite the metric)?. It feels a little heavy handed 
to throw an exception deep within the controller election process due to a 
metrics issue. This resulted in the whole cluster falling into a sustained 
controller re-election spin.

> Brokers stuck in controller re-election loop after failing to register metrics
> --
>
> Key: KAFKA-4900
> URL: https://issues.apache.org/jira/browse/KAFKA-4900
> Project: Kafka
>  Issue Type: Bug
>  Components: controller, core
>Affects Versions: 0.10.1.1
>Reporter: Nick Travers
>
> We hit this today in one of out three node staging clusters. The exception 
> continues to occur on all three nodes.
> {code}
> 2017-03-15 02:17:30,677 ERROR 
> [ZkClient-EventThread-35-samsa-zkserver.stage.sjc1.square:26101/samsa] 
> server.ZookeeperLeaderElector - Error while electing or becoming leader on 
> broker 9
> java.lang.IllegalArgumentException: A metric named 'MetricName 
> [name=connection-close-rate, 
> group=controller-channel-metrics,description=Connections closed per second in 
> the window., tags={broker-id=10}]' already exists, can't register another one.
> at 
> org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:380)
> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:179)
> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:164)
> at 
> org.apache.kafka.common.network.Selector$SelectorMetrics.(Selector.java:617)
> at org.apache.kafka.common.network.Selector.(Selector.java:138)
> at 
> kafka.controller.ControllerChannelManager.kafka$controller$ControllerChannelManager$$addNewBroker(ControllerChannelManager.scala:101)
> at 
> kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
> at 
> kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
> at scala.collection.immutable.Set$Set3.foreach(Set.scala:163)
> at 
> kafka.controller.ControllerChannelManager.(ControllerChannelManager.scala:45)
> at 
> kafka.controller.KafkaController.startChannelManager(KafkaController.scala:814)
> at 
> kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:742)
> at 
> kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:334)
> at 
> kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:167)
> at 
> kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141)
> 

[jira] [Updated] (KAFKA-4914) Partition re-assignment tool should check types before persisting state in ZooKeeper

2017-03-19 Thread Nick Travers (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Travers updated KAFKA-4914:

Status: Patch Available  (was: Open)

> Partition re-assignment tool should check types before persisting state in 
> ZooKeeper
> 
>
> Key: KAFKA-4914
> URL: https://issues.apache.org/jira/browse/KAFKA-4914
> Project: Kafka
>  Issue Type: Improvement
>  Components: admin
>Affects Versions: 0.10.1.1
>    Reporter: Nick Travers
>Priority: Minor
>
> The partition-reassignment too currently allows non-type-safe information to 
> be persisted into ZooKeeper, which can result in a ClassCastException at 
> runtime for brokers.
> Specifically, this occurred when the broker assignment field was a List of 
> Strings, instead of a List of Integers.
> {code}
> 2017-03-15 01:44:04,572 ERROR 
> [ZkClient-EventThread-36-samsa-zkserver.stage.sjc1.square:26101/samsa] 
> controller.ReplicaStateMachine$BrokerChangeListener - [BrokerChangeListener 
> on Controller 10]: Error while handling broker changes
> java.lang.ClassCastException: java.lang.String cannot be cast to 
> java.lang.Integer
> at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
> at 
> kafka.controller.KafkaController$$anonfun$8$$anonfun$apply$2.apply(KafkaController.scala:436)
> at 
> scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93)
> at scala.collection.immutable.List.exists(List.scala:84)
> at 
> kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:436)
> at 
> kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:435)
> at 
> scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
> at 
> scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
> at 
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
> at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
> at 
> kafka.controller.KafkaController.onBrokerStartup(KafkaController.scala:435)
> at 
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ReplicaStateMachine.scala:374)
> at 
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:358)
> at 
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:358)
> at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
> at 
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(ReplicaStateMachine.scala:357)
> at 
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:356)
> at 
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:356)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at 
> kafka.controller.ReplicaStateMachine$BrokerChangeListener.handleChildChange(ReplicaStateMachine.scala:355)
> at org.I0Itec.zkclient.ZkClient$10.run(ZkClient.java:843)
> at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-4900) Brokers stuck in controller re-election loop after failing to register metrics

2017-03-26 Thread Nick Travers (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15942542#comment-15942542
 ] 

Nick Travers commented on KAFKA-4900:
-

[~onurkaraman] - I like the sound of this theory. While I don't think we hit 
_exactly_ the same concurrency bug heading _into_ the controller re-election 
loop, I think that what you outline in terms of hitting the metrics issue 
sounds reasonable. In our case, we hit a {{ClassCastException}} on controller 
(re-)election, which most likely resulted in the metrics (amongst other things) 
not being cleaned up.

This also makes sense with what we observed, as each broker hit the same 
{{ClassCastException}} (as a result of KAFKA-4914), which left each broker with 
controller metrics that were not cleaned up.

It sounds like a potential solution here is to fix both of the underlying 
issues that caused the brokers to leave "dirty state" lying around when the 
controller election process failed (partially tracked by KAFKA-4447 and 
KAFKA-4914), as well as the issue of the controller not cleaning up the state 
itself.

> Brokers stuck in controller re-election loop after failing to register metrics
> --
>
> Key: KAFKA-4900
> URL: https://issues.apache.org/jira/browse/KAFKA-4900
> Project: Kafka
>  Issue Type: Bug
>  Components: controller, core
>Affects Versions: 0.10.1.1
>Reporter: Nick Travers
>
> We hit this today in one of out three node staging clusters. The exception 
> continues to occur on all three nodes.
> {code}
> 2017-03-15 02:17:30,677 ERROR 
> [ZkClient-EventThread-35-samsa-zkserver.stage.sjc1.square:26101/samsa] 
> server.ZookeeperLeaderElector - Error while electing or becoming leader on 
> broker 9
> java.lang.IllegalArgumentException: A metric named 'MetricName 
> [name=connection-close-rate, 
> group=controller-channel-metrics,description=Connections closed per second in 
> the window., tags={broker-id=10}]' already exists, can't register another one.
> at 
> org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:380)
> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:179)
> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:164)
> at 
> org.apache.kafka.common.network.Selector$SelectorMetrics.(Selector.java:617)
> at org.apache.kafka.common.network.Selector.(Selector.java:138)
> at 
> kafka.controller.ControllerChannelManager.kafka$controller$ControllerChannelManager$$addNewBroker(ControllerChannelManager.scala:101)
> at 
> kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
> at 
> kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
> at scala.collection.immutable.Set$Set3.foreach(Set.scala:163)
> at 
> kafka.controller.ControllerChannelManager.(ControllerChannelManager.scala:45)
> at 
> kafka.controller.KafkaController.startChannelManager(KafkaController.scala:814)
> at 
> kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:742)
> at 
> kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:334)
> at 
> kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:167)
> at 
> kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141)
> at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:824)
> at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) 
> {code}
> We observe a tight loop of controller (re-)election, i.e. one node hits this 
> exception, and leadership transitions to the next, which then hits the 
> exception, ad infinitum.
> Producers and consumers appear to be connecting ok, and are able to produce 
> and consume messages.
> Relevant data points:
>  - prior to this cluster restart, a partition reassig

[jira] [Commented] (KAFKA-4900) Brokers stuck in controller re-election loop after failing to register metrics

2017-04-08 Thread Nick Travers (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15961885#comment-15961885
 ] 

Nick Travers commented on KAFKA-4900:
-

[~onurkaraman] - I think so. This ticket seems to deal more with the issue of 
the controller re-election failing due to metrics already existing.

> Brokers stuck in controller re-election loop after failing to register metrics
> --
>
> Key: KAFKA-4900
> URL: https://issues.apache.org/jira/browse/KAFKA-4900
> Project: Kafka
>  Issue Type: Bug
>  Components: controller, core
>Affects Versions: 0.10.1.1
>Reporter: Nick Travers
>
> We hit this today in one of out three node staging clusters. The exception 
> continues to occur on all three nodes.
> {code}
> 2017-03-15 02:17:30,677 ERROR 
> [ZkClient-EventThread-35-samsa-zkserver.stage.sjc1.square:26101/samsa] 
> server.ZookeeperLeaderElector - Error while electing or becoming leader on 
> broker 9
> java.lang.IllegalArgumentException: A metric named 'MetricName 
> [name=connection-close-rate, 
> group=controller-channel-metrics,description=Connections closed per second in 
> the window., tags={broker-id=10}]' already exists, can't register another one.
> at 
> org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:380)
> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:179)
> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:164)
> at 
> org.apache.kafka.common.network.Selector$SelectorMetrics.(Selector.java:617)
> at org.apache.kafka.common.network.Selector.(Selector.java:138)
> at 
> kafka.controller.ControllerChannelManager.kafka$controller$ControllerChannelManager$$addNewBroker(ControllerChannelManager.scala:101)
> at 
> kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
> at 
> kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
> at scala.collection.immutable.Set$Set3.foreach(Set.scala:163)
> at 
> kafka.controller.ControllerChannelManager.(ControllerChannelManager.scala:45)
> at 
> kafka.controller.KafkaController.startChannelManager(KafkaController.scala:814)
> at 
> kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:742)
> at 
> kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:334)
> at 
> kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:167)
> at 
> kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141)
> at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:824)
> at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) 
> {code}
> We observe a tight loop of controller (re-)election, i.e. one node hits this 
> exception, and leadership transitions to the next, which then hits the 
> exception, ad infinitum.
> Producers and consumers appear to be connecting ok, and are able to produce 
> and consume messages.
> Relevant data points:
>  - prior to this cluster restart, a partition reassignment was attempted for 
> a number of topics, which appeared to get stuck in the "in progress" state 
> (on the order of days)
>  - these topics were subsequently deleted
>  - a rolling restart of the cluster was performed was to turn on 
> broker-to-broker SSL communication
>  - the SSL change has subsequently been _rolled back_ after we observed these 
> exceptions
>  - the entire cluster was shut down, and nodes brought back one at a time in 
> an attempt to clear the exception. We were able to restart the cluster, but 
> we continue to see the exceptions
> We also observed, during the same time as the exception above, the following 
> exception on all hosts:
> {code}
> 2017-03-15 01:44:04,572 ERROR 
> [ZkClient-EventThrea

[jira] [Created] (KAFKA-4681) Disallow reassign-partitions script to assign partitions to non-existent brokers

2017-01-20 Thread Nick Travers (JIRA)
Nick Travers created KAFKA-4681:
---

 Summary: Disallow reassign-partitions script to assign partitions 
to non-existent brokers
 Key: KAFKA-4681
 URL: https://issues.apache.org/jira/browse/KAFKA-4681
 Project: Kafka
  Issue Type: Improvement
Reporter: Nick Travers
Priority: Minor


When running the {{kafka-reassign-partitions.sh}} script, it is possible for 
partitions to be accidentally assigned to brokers that do not exist.

This results in partitions with an indefinitely reduced ISR set, as the 
partition can never be replicated to the non-existent broker.

The solution was to add a new broker with the "bogus" broker ID, reassign the 
partitions to other brokers, before decommissioning the bogus broker. This is 
not ideal due to the increased operational burden of adding and removing 
brokers, in addition to manually moving partitions around.

Suggest patching the script to disallow assignment of partitions to brokers 
that either do not exist, or brokers that are dead, and not participating in 
the ISR of other partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4698) `-target:jvm-1.7` in scalaCompileOptions seems to cause java.lang.VerifyError when loading kafka/client/ClientUtils

2017-01-25 Thread Nick Travers (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838978#comment-15838978
 ] 

Nick Travers commented on KAFKA-4698:
-

[~ijuma] [~hachikuji] - any chance we can get this into 0.10.1.2 (if that's 
going to be a release) as a bugfix / hotfix?

> `-target:jvm-1.7` in scalaCompileOptions seems to cause java.lang.VerifyError 
> when loading kafka/client/ClientUtils 
> 
>
> Key: KAFKA-4698
> URL: https://issues.apache.org/jira/browse/KAFKA-4698
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.1.1
>Reporter: Chris Heisterkamp
>Assignee: Ismael Juma
> Fix For: 0.10.2.0
>
>
> When we updated to kafka 2.11-0.10.1.1 from 2.11-0.8.2.2 we started getting 
> the following error 
> {noformat}
> java.lang.VerifyError: Stack map does not match the one at exception handler 
> 190
> Exception Details:
>   Location:
> 
> kafka/client/ClientUtils$.fetchTopicMetadata(Lscala/collection/Set;Lscala/collection/Seq;Lkafka/producer/ProducerConfig;I)Lkafka/api/TopicMetadataResponse;
>  @190: astore
>   Reason:
> Type top (current frame, locals[12]) is not assignable to 
> 'kafka/producer/SyncProducer' (stack map, locals[12])
>   Current Frame:
> bci: @63
> flags: { }
> locals: { 'kafka/client/ClientUtils$', 'scala/collection/Set', 
> 'scala/collection/Seq', 'kafka/producer/ProducerConfig', integer, integer, 
> 'scala/runtime/IntRef', 'kafka/api/TopicMetadataRequest', 
> 'kafka/api/TopicMetadataResponse', 'java/lang/Throwable', 
> 'scala/collection/Seq', 'kafka/producer/SyncProducer' }
> stack: { 'java/lang/Throwable' }
>   Stackmap Frame:
> bci: @190
> flags: { }
> locals: { 'kafka/client/ClientUtils$', 'scala/collection/Set', 
> 'scala/collection/Seq', 'kafka/producer/ProducerConfig', integer, integer, 
> 'scala/runtime/IntRef', 'kafka/api/TopicMetadataRequest', 
> 'kafka/api/TopicMetadataResponse', 'java/lang/Throwable', 
> 'scala/collection/Seq', top, 'kafka/producer/SyncProducer' }
> stack: { 'java/lang/Throwable' }
>   Bytecode:
> 0x000: 0336 0503 b800 aa3a 06bb 00ac 59b2 00b1
> 0x010: b600 b515 042d b600 ba2b b900 c001 00b7
> 0x020: 00c3 3a07 013a 0801 3a09 b200 c82c b200
> 0x030: cdb6 00d1 b600 d5c0 00d7 3a0a a700 353a
> 0x040: 0b2a bb00 0b59 2b15 0419 0619 0ab7 00de
> 0x050: bb00 0d59 190b b700 e1b6 00e3 190b 3a09
> 0x060: 1906 1906 b400 e704 60b5 00e7 190c b600
> 0x070: ea19 06b4 00e7 190a b900 ee01 00a2 0057
> 0x080: 1505 9a00 52b2 00f3 2d19 0a19 06b4 00e7
> 0x090: b900 f702 00c0 00f9 b600 fd3a 0c2a bb00
> 0x0a0: 0f59 2b15 0419 0619 0ab7 00fe b601 0019
> 0x0b0: 0c19 07b6 0104 3a08 0436 05a7 ffa5 3a0d
> 0x0c0: 1906 1906 b400 e704 60b5 00e7 190c b600
> 0x0d0: ea19 0dbf 1505 9900 122a bb00 1159 2bb7
> 0x0e0: 0107 b601 0919 08b0 bb01 0b59 bb01 0d59
> 0x0f0: b201 1213 0114 b601 18b7 011a b201 1205
> 0x100: bd00 0459 032b 5359 0419 0a53 b601 1eb6
> 0x110: 0122 1909 b701 25bf
>   Exception Handler Table:
> bci [175, 190] => handler: 63
> bci [175, 190] => handler: 190
> bci [63, 96] => handler: 190
>   Stackmap Table:
> 
> full_frame(@63,{Object[#2],Object[#188],Object[#215],Object[#183],Integer,Integer,Object[#166],Object[#172],Object[#217],Object[#72],Object[#215],Object[#219]},{Object[#72]})
> 
> full_frame(@96,{Object[#2],Object[#188],Object[#215],Object[#183],Integer,Integer,Object[#166],Object[#172],Object[#217],Object[#72],Object[#215],Top,Object[#219]},{})
> chop_frame(@113,2)
> 
> full_frame(@190,{Object[#2],Object[#188],Object[#215],Object[#183],Integer,Integer,Object[#166],Object[#172],Object[#217],Object[#72],Object[#215],Top,Object[#219]},{Object[#72]})
> chop_frame(@212,2)
> same_frame(@232)
>  
>   at 
> kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:65)
>   at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
> {noformat}
> Which looks like it's coming from the change in 
> https://issues.apache.org/jira/browse/KAFKA-3375.  The suggested fix was to 
> rebuild the

[jira] [Created] (KAFKA-5236) Regression in on-disk log size when using Snappy compression with 0.8.2 log message format

2017-05-14 Thread Nick Travers (JIRA)
Nick Travers created KAFKA-5236:
---

 Summary: Regression in on-disk log size when using Snappy 
compression with 0.8.2 log message format
 Key: KAFKA-5236
 URL: https://issues.apache.org/jira/browse/KAFKA-5236
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 0.10.2.1
Reporter: Nick Travers


We recently upgraded our brokers in our production environments from 0.10.1.1 
to 0.10.2.1 and we've noticed a sizable regression in the on-disk .log file 
size. For some deployments the increase was as much as 50%.

We run our brokers with the 0.8.2 log message format version. The majority of 
our message volume comes from 0.10.x Java clients sending messages encoded with 
the Snappy codec.

Some initial testing only shows a regression between the two versions when 
using Snappy compression with a log message format of 0.8.2.

I also tested 0.10.x log message formats as well as Gzip compression. The log 
sizes do not differ in this case, so the issue seems confined to 0.8.2 message 
format and Snappy compression.

A git-bisect lead me to this commit, which modified the server-side 
implementation of `Record`:
https://github.com/apache/kafka/commit/67f1e5b91bf073151ff57d5d656693e385726697

Here's the PR, which has more context:
https://github.com/apache/kafka/pull/2140

Here is a link to the test I used to re-producer this issue:
https://github.com/nicktrav/kafka/commit/68e8db4fa525e173651ac740edb270b0d90b8818

cc: [~hachikuji] [~junrao] [~ijuma] [~guozhang] (tagged on original PR)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5236) Regression in on-disk log size when using Snappy compression with 0.8.2 log message format

2017-06-02 Thread Nick Travers (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035404#comment-16035404
 ] 

Nick Travers commented on KAFKA-5236:
-

Thanks for the patch [~ijuma]! I ran the same tests I used originally and 
confirmed that there was no regression in the on-disk size.

> Regression in on-disk log size when using Snappy compression with 0.8.2 log 
> message format
> --
>
> Key: KAFKA-5236
> URL: https://issues.apache.org/jira/browse/KAFKA-5236
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.2.1
>Reporter: Nick Travers
>Assignee: Ismael Juma
>  Labels: regression
> Fix For: 0.11.0.0
>
>
> We recently upgraded our brokers in our production environments from 0.10.1.1 
> to 0.10.2.1 and we've noticed a sizable regression in the on-disk .log file 
> size. For some deployments the increase was as much as 50%.
> We run our brokers with the 0.8.2 log message format version. The majority of 
> our message volume comes from 0.10.x Java clients sending messages encoded 
> with the Snappy codec.
> Some initial testing only shows a regression between the two versions when 
> using Snappy compression with a log message format of 0.8.2.
> I also tested 0.10.x log message formats as well as Gzip compression. The log 
> sizes do not differ in this case, so the issue seems confined to 0.8.2 
> message format and Snappy compression.
> A git-bisect lead me to this commit, which modified the server-side 
> implementation of `Record`:
> https://github.com/apache/kafka/commit/67f1e5b91bf073151ff57d5d656693e385726697
> Here's the PR, which has more context:
> https://github.com/apache/kafka/pull/2140
> Here is a link to the test I used to re-producer this issue:
> https://github.com/nicktrav/kafka/commit/68e8db4fa525e173651ac740edb270b0d90b8818
> cc: [~hachikuji] [~junrao] [~ijuma] [~guozhang] (tagged on original PR)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)