[
https://issues.apache.org/jira/browse/KAFKA-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jun Rao resolved KAFKA-10002.
-----------------------------
Fix Version/s: 2.7.0
Resolution: Fixed
merged the PR to trunk
> Improve performances of StopReplicaRequest with large number of partitions to
> be deleted
> ----------------------------------------------------------------------------------------
>
> Key: KAFKA-10002
> URL: https://issues.apache.org/jira/browse/KAFKA-10002
> Project: Kafka
> Issue Type: Improvement
> Reporter: David Jacot
> Assignee: David Jacot
> Priority: Major
> Fix For: 2.7.0
>
>
> I have noticed that StopReplicaRequests with partitions to be deleted are
> extremely slow when there is more than 2000 partitions which leads to hitting
> the request timeout in the controller. A request with 2000 partitions to be
> deleted still works but performances degrades significantly with the number
> increases. For examples, a request with 3000 partitions to be deletes takes
> appox. 60 seconds to be processed.
> A CPU profile shows that most of the time is spent in checkpointing log start
> offsets and recovery offsets. Almost 90% of the time is there. See attached.
> When a partition is deleted, the replica manager calls
> `ReplicaManager#asyncDelete` that checkpoints recovery offsets and log start
> offsets. As the checkpoints are per data directory, the checkpointing is made
> for all the partitions in the directory of the partition to be deleted. In
> our case where we have only one data directory, if you deletes 1000
> partitions, we end up checkpointing the same things 1000 times which is not
> efficient.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)