[ 
https://issues.apache.org/jira/browse/KAFKA-9393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Bradstreet updated KAFKA-9393:
------------------------------------
    Issue Type: Bug  (was: Improvement)
       Summary: DeleteRecords may cause extreme lock contention for large 
partition directories  (was: DeleteRecords triggers extreme lock contention for 
large partition directories)

> DeleteRecords may cause extreme lock contention for large partition 
> directories
> -------------------------------------------------------------------------------
>
>                 Key: KAFKA-9393
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9393
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 2.2.0, 2.3.0, 2.4.0
>            Reporter: Lucas Bradstreet
>            Priority: Major
>
> DeleteRecords, frequently used by KStreams triggers a 
> Log.maybeIncrementLogStartOffset call, calling 
> kafka.log.ProducerStateManager.listSnapshotFiles which calls 
> java.io.File.listFiles on the partition dir. The time taken to list this 
> directory can be extreme for partitions with many small segments (e.g 20000) 
> taking multiple seconds to finish. This causes lock contention for the log, 
> and if produce requests are also occurring for the same log can cause a 
> majority of request handler threads to become blocked waiting for the 
> DeleteRecords call to finish.
> I believe this is a problem going back to the initial implementation of the 
> transactional producer, but I need to confirm how far back it goes.
> One possible solution is to maintain a producer state snapshot aligned to the 
> log segment, and simply delete it whenever we delete a segment. This would 
> ensure that we never have to perform a directory scan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to