[ https://issues.apache.org/jira/browse/FLINK-19008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jun Qin updated FLINK-19008: ---------------------------- Description: A customer runs a Flink job with RocksDB state backend. Checkpoints are retained and done incrementally. The state size is several TB. When they restore + downscale from a retained checkpoint, although the downloading of checkpoint files took ~20min, the job throughput returns to the expected level only after 3 hours. I do not have RocksDB logs. The suspicion for those 3 hours is due to heavy RocksDB compaction and/or flush. As it was observed that checkpoint could not finish faster enough due to long {{checkpoint duration (sync)}}. How can we make this restoring phase shorter? For compaction, I think it is worth to check the improvement of: {code:c} CompactionPri compaction_pri = kMinOverlappingRatio;{code} which has been set to default in RocksDB 6.x: {code:c} // In Level-based compaction, it Determines which file from a level to be // picked to merge to the next level. We suggest people try // kMinOverlappingRatio first when you tune your database. enum CompactionPri : char { // Slightly prioritize larger files by size compensated by #deletes kByCompensatedSize = 0x0, // First compact files whose data's latest update time is oldest. // Try this if you only update some hot keys in small ranges. kOldestLargestSeqFirst = 0x1, // First compact files whose range hasn't been compacted to the next level // for the longest. If your updates are random across the key space, // write amplification is slightly better with this option. kOldestSmallestSeqFirst = 0x2, // First compact files whose ratio between overlapping size in next level // and its size is the smallest. It in many cases can optimize write // amplification. kMinOverlappingRatio = 0x3, }; ... // Default: kMinOverlappingRatio CompactionPri compaction_pri = kMinOverlappingRatio;{code} was: A customer runs a Flink job with RocksDB state backend. Checkpoints are retained and done incrementally. The state size is several TB. When they restore + downscale from a retained checkpoint, although the downloading of checkpoint files took ~20min, the job throughput returns to the expected level only after 3 hours. I do not have RocksDB logs. The suspicion for those 3 hours is due to heavy RocksDB compaction and/or flush. As it was observed that checkpoint could not finish faster enough due to long {{checkpoint duration (sync)}}. How can we make this restoring phase shorter? For compaction, I think it is worth to check the improvement of: {code:java} CompactionPri compaction_pri = kMinOverlappingRatio;{code} which has been set to default in RocksDB 6.x: {code:java} // In Level-based compaction, it Determines which file from a level to be // picked to merge to the next level. We suggest people try // kMinOverlappingRatio first when you tune your database. enum CompactionPri : char { // Slightly prioritize larger files by size compensated by #deletes kByCompensatedSize = 0x0, // First compact files whose data's latest update time is oldest. // Try this if you only update some hot keys in small ranges. kOldestLargestSeqFirst = 0x1, // First compact files whose range hasn't been compacted to the next level // for the longest. If your updates are random across the key space, // write amplification is slightly better with this option. kOldestSmallestSeqFirst = 0x2, // First compact files whose ratio between overlapping size in next level // and its size is the smallest. It in many cases can optimize write // amplification. kMinOverlappingRatio = 0x3, }; ... // Default: kMinOverlappingRatio CompactionPri compaction_pri = kMinOverlappingRatio;{code} > Flink Job runs slow after restore + downscale from an incremental checkpoint > (rocksdb) > -------------------------------------------------------------------------------------- > > Key: FLINK-19008 > URL: https://issues.apache.org/jira/browse/FLINK-19008 > Project: Flink > Issue Type: Improvement > Reporter: Jun Qin > Priority: Major > > A customer runs a Flink job with RocksDB state backend. Checkpoints are > retained and done incrementally. The state size is several TB. When they > restore + downscale from a retained checkpoint, although the downloading of > checkpoint files took ~20min, the job throughput returns to the expected > level only after 3 hours. > I do not have RocksDB logs. The suspicion for those 3 hours is due to heavy > RocksDB compaction and/or flush. As it was observed that checkpoint could not > finish faster enough due to long {{checkpoint duration (sync)}}. How can we > make this restoring phase shorter? > For compaction, I think it is worth to check the improvement of: > {code:c} > CompactionPri compaction_pri = kMinOverlappingRatio;{code} > which has been set to default in RocksDB 6.x: > {code:c} > // In Level-based compaction, it Determines which file from a level to be > // picked to merge to the next level. We suggest people try > // kMinOverlappingRatio first when you tune your database. > enum CompactionPri : char { > // Slightly prioritize larger files by size compensated by #deletes > kByCompensatedSize = 0x0, > // First compact files whose data's latest update time is oldest. > // Try this if you only update some hot keys in small ranges. > kOldestLargestSeqFirst = 0x1, > // First compact files whose range hasn't been compacted to the next level > // for the longest. If your updates are random across the key space, > // write amplification is slightly better with this option. > kOldestSmallestSeqFirst = 0x2, > // First compact files whose ratio between overlapping size in next level > // and its size is the smallest. It in many cases can optimize write > // amplification. > kMinOverlappingRatio = 0x3, > }; > ... > // Default: kMinOverlappingRatio > CompactionPri compaction_pri = kMinOverlappingRatio;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)