Gardner Vickers created KAFKA-12964:
---------------------------------------

             Summary: Corrupt segment recovery can delete new producer state 
snapshots
                 Key: KAFKA-12964
                 URL: https://issues.apache.org/jira/browse/KAFKA-12964
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 2.8.0
            Reporter: Gardner Vickers
            Assignee: Gardner Vickers


During log recovery, we may schedule asynchronous deletion in 
deleteSegmentFiles.

[https://github.com/apache/kafka/blob/fc5245d8c37a6c9d585c5792940a8f9501bedbe1/core/src/main/scala/kafka/log/Log.scala#L2382]

If we're truncating the log, this may result in deletions for segments with 
matching base offsets to segments which will be written in the future. To avoid 
asynchronously deleting future segments, we rename the segment and index files, 
but we do not do this for producer state snapshot files. 

This leaves us vulnerable to a race condition where we could end up deleting 
snapshot files for segments written after log recovery when async deletion runs.

 

To fix this, we should first remove the `SnapshotFile` from the 
`ProducerStateManager` and rename the file to have a `Log.DeletedFileSuffix`. 
Then we can asynchronously delete the snapshot file later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to