[ https://issues.apache.org/jira/browse/FLINK-34050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17819257#comment-17819257 ]
Jinzhong Li commented on FLINK-34050: ------------------------------------- [~srichter] Thanks for your comments. I think: 1. The current solution utilizes the deleteFilesInRanges api to bulk remove useless files. This process does not involve file data read and write, so it is expected to be very fast, with no significant impact on rescaling times. (I will validate this leveraging the rescaling benchmark) 2. In scenarios with large states, the space amplification problem mentioned in this issue may cause no space left on local disk during rescaling, resulting in a rescaling failure. Clearly, the async deletion can't solve this problem. In addition, considering that the implementation of async deletion is more complex, so I think that current proposal (sync deletion) is the better way to solve this problem ? > Rocksdb state has space amplification after rescaling with DeleteRange > ---------------------------------------------------------------------- > > Key: FLINK-34050 > URL: https://issues.apache.org/jira/browse/FLINK-34050 > Project: Flink > Issue Type: Bug > Components: Runtime / State Backends > Reporter: Jinzhong Li > Assignee: Jinzhong Li > Priority: Major > Attachments: image-2024-01-10-21-23-48-134.png, > image-2024-01-10-21-24-10-983.png, image-2024-01-10-21-28-24-312.png > > > FLINK-21321 use deleteRange to speed up rocksdb rescaling, however it will > cause space amplification in some case. > We can reproduce this problem using wordCount job: > 1) before rescaling, state operator in wordCount job has 2 parallelism and > 4G+ full checkpoint size; > !image-2024-01-10-21-24-10-983.png|width=266,height=130! > 2) then restart job with 4 parallelism (for state operator), the full > checkpoint size of new job will be 8G+ ; > 3) after many successful checkpoints, the full checkpoint size is still 8G+; > !image-2024-01-10-21-28-24-312.png|width=454,height=111! > > The root cause of this issue is that the deleted keyGroupRange does not > overlap with current DB keyGroupRange, so new data written into rocksdb after > rescaling almost never do LSM compaction with the deleted data (belonging to > other keyGroupRange.) > > And the space amplification may affect Rocksdb read performance and disk > space usage after rescaling. It looks like a regression due to the > introduction of deleteRange for rescaling optimization. > > To slove this problem, I think maybe we can invoke > Rocksdb.deleteFilesInRanges after deleteRange? > {code:java} > public static void clipDBWithKeyGroupRange() { > //....... > List<byte[]> ranges = new ArrayList<>(); > //....... > deleteRange(db, columnFamilyHandles, beginKeyGroupBytes, endKeyGroupBytes); > ranges.add(beginKeyGroupBytes); > ranges.add(endKeyGroupBytes); > //.... > for (ColumnFamilyHandle columnFamilyHandle : columnFamilyHandles) { > db.deleteFilesInRanges(columnFamilyHandle, ranges, false); > } > } > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)