[ https://issues.apache.org/jira/browse/FLINK-34050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814312#comment-17814312 ]
Stefan Richter edited comment on FLINK-34050 at 2/5/24 10:41 AM: ----------------------------------------------------------------- Just one idea: since the current proposal is making the rescaling times worse, it can have significant drawback. How about we call deleteFiles async before the next checkpoint after a rescaling, thus making sure that the space amplification never makes it into the checkpoint and doing it outside of a critical path for restoring. Wdyt? was (Author: srichter): Just one idea: since the current proposal is making the rescaling times worse, it can have significant drawback. How about we call deleteFiles in the async part of the next checkpoint after a rescaling, thus making sure that the space amplification never makes it into the checkpoint and doing it outside of a critical path for restoring or processing. Wdyt? > Rocksdb state has space amplification after rescaling with DeleteRange > ---------------------------------------------------------------------- > > Key: FLINK-34050 > URL: https://issues.apache.org/jira/browse/FLINK-34050 > Project: Flink > Issue Type: Bug > Components: Runtime / State Backends > Reporter: Jinzhong Li > Assignee: Jinzhong Li > Priority: Major > Attachments: image-2024-01-10-21-23-48-134.png, > image-2024-01-10-21-24-10-983.png, image-2024-01-10-21-28-24-312.png > > > FLINK-21321 use deleteRange to speed up rocksdb rescaling, however it will > cause space amplification in some case. > We can reproduce this problem using wordCount job: > 1) before rescaling, state operator in wordCount job has 2 parallelism and > 4G+ full checkpoint size; > !image-2024-01-10-21-24-10-983.png|width=266,height=130! > 2) then restart job with 4 parallelism (for state operator), the full > checkpoint size of new job will be 8G+ ; > 3) after many successful checkpoints, the full checkpoint size is still 8G+; > !image-2024-01-10-21-28-24-312.png|width=454,height=111! > > The root cause of this issue is that the deleted keyGroupRange does not > overlap with current DB keyGroupRange, so new data written into rocksdb after > rescaling almost never do LSM compaction with the deleted data (belonging to > other keyGroupRange.) > > And the space amplification may affect Rocksdb read performance and disk > space usage after rescaling. It looks like a regression due to the > introduction of deleteRange for rescaling optimization. > > To slove this problem, I think maybe we can invoke > Rocksdb.deleteFilesInRanges after deleteRange? > {code:java} > public static void clipDBWithKeyGroupRange() { > //....... > List<byte[]> ranges = new ArrayList<>(); > //....... > deleteRange(db, columnFamilyHandles, beginKeyGroupBytes, endKeyGroupBytes); > ranges.add(beginKeyGroupBytes); > ranges.add(endKeyGroupBytes); > //.... > for (ColumnFamilyHandle columnFamilyHandle : columnFamilyHandles) { > db.deleteFilesInRanges(columnFamilyHandle, ranges, false); > } > } > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)