[ https://issues.apache.org/jira/browse/KAFKA-9393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lucas Bradstreet updated KAFKA-9393: ------------------------------------ Issue Type: Bug (was: Improvement) Summary: DeleteRecords may cause extreme lock contention for large partition directories (was: DeleteRecords triggers extreme lock contention for large partition directories) > DeleteRecords may cause extreme lock contention for large partition > directories > ------------------------------------------------------------------------------- > > Key: KAFKA-9393 > URL: https://issues.apache.org/jira/browse/KAFKA-9393 > Project: Kafka > Issue Type: Bug > Affects Versions: 2.2.0, 2.3.0, 2.4.0 > Reporter: Lucas Bradstreet > Priority: Major > > DeleteRecords, frequently used by KStreams triggers a > Log.maybeIncrementLogStartOffset call, calling > kafka.log.ProducerStateManager.listSnapshotFiles which calls > java.io.File.listFiles on the partition dir. The time taken to list this > directory can be extreme for partitions with many small segments (e.g 20000) > taking multiple seconds to finish. This causes lock contention for the log, > and if produce requests are also occurring for the same log can cause a > majority of request handler threads to become blocked waiting for the > DeleteRecords call to finish. > I believe this is a problem going back to the initial implementation of the > transactional producer, but I need to confirm how far back it goes. > One possible solution is to maintain a producer state snapshot aligned to the > log segment, and simply delete it whenever we delete a segment. This would > ensure that we never have to perform a directory scan. -- This message was sent by Atlassian Jira (v8.3.4#803005)