[ https://issues.apache.org/jira/browse/KAFKA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16221596#comment-16221596 ]
ASF GitHub Bot commented on KAFKA-6134: --------------------------------------- GitHub user hachikuji opened a pull request: https://github.com/apache/kafka/pull/4141 KAFKA-6134: Read partition reassignment lazily on event handling This patch prevents an O(n^2) increase in memory utilization during partition reassignment. Instead of storing the reassigned partitions in the `PartitionReassignment` object (which is added after ever partition reassignment), we read the data fresh from ZK when processing the event. You can merge this pull request into a Git repository by running: $ git pull https://github.com/hachikuji/kafka KAFKA-6134 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/4141.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4141 ---- commit 5131bb19f6fe7fc1939035c48ead052a0ac967a4 Author: Jason Gustafson <ja...@confluent.io> Date: 2017-10-27T02:01:05Z KAFKA-6134: Read partition reassignment lazily on event handling ---- > High memory usage on controller during partition reassignment > ------------------------------------------------------------- > > Key: KAFKA-6134 > URL: https://issues.apache.org/jira/browse/KAFKA-6134 > Project: Kafka > Issue Type: Bug > Components: controller > Affects Versions: 0.11.0.0, 0.11.0.1 > Reporter: Jason Gustafson > Assignee: Jason Gustafson > Priority: Critical > Labels: regression > Fix For: 1.0.0, 0.11.0.2 > > Attachments: Screen Shot 2017-10-26 at 3.05.40 PM.png > > > We've had a couple users reporting spikes in memory usage when the controller > is performing partition reassignment in 0.11. After investigation, we found > that the controller event queue was using most of the retained memory. In > particular, we found several thousand {{PartitionReassignment}} objects, each > one containing one fewer partition than the previous one (see the attached > image). > From the code, it seems clear why this is happening. We have a watch on the > partition reassignment path which adds the {{PartitionReassignment}} object > to the event queue: > {code} > override def handleDataChange(dataPath: String, data: Any): Unit = { > val partitionReassignment = > ZkUtils.parsePartitionReassignmentData(data.toString) > eventManager.put(controller.PartitionReassignment(partitionReassignment)) > } > {code} > In the {{PartitionReassignment}} event handler, we iterate through all of the > partitions in the reassignment. After we complete reassignment for each > partition, we remove that partition and update the node in zookeeper. > {code} > // remove this partition from that list > val updatedPartitionsBeingReassigned = partitionsBeingReassigned - > topicAndPartition > // write the new list to zookeeper > > zkUtils.updatePartitionReassignmentData(updatedPartitionsBeingReassigned.mapValues(_.newReplicas)) > {code} > This triggers the handler above which adds a new event in the queue. So what > you get is an n^2 increase in memory where n is the number of partitions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)