David Jacot created KAFKA-10002:
-----------------------------------

             Summary: Improve performances of StopReplicaRequest with large 
number of partitions to be deleted
                 Key: KAFKA-10002
                 URL: https://issues.apache.org/jira/browse/KAFKA-10002
             Project: Kafka
          Issue Type: Improvement
            Reporter: David Jacot
            Assignee: David Jacot


I have noticed that StopReplicaRequests with partitions to be deleted are 
extremely slow when there is more than 2000 partitions which leads to hitting 
the request timeout in the controller. A request with 2000 partitions to be 
deleted still works but performances degrades significantly with the number 
increases. For examples, a request with 3000 partitions to be deletes takes 
appox. 60 seconds to be processed.

A CPU profile shows that most of the time is spent in checkpointing log start 
offsets and recovery offsets. Almost 90% of the time is there. See attached. 
When a partition is deleted, the replica manager calls 
`ReplicaManager#asyncDelete` that checkpoints recovery offsets and log start 
offsets. As the checkpoints are per data directory, the checkpointing is made 
for all the partitions in the directory of the partition to be deleted. In our 
case where we have only one data directory, if you deletes 1000 partitions, we 
end up checkpointing the same things 1000 times which is not efficient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to