[jira] [Updated] (FLINK-19008) Flink Job runs slow after restore + downscale from an incremental checkpoint (rocksdb)

Jun Qin (Jira) Thu, 20 Aug 2020 14:40:24 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-19008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jun Qin updated FLINK-19008:
----------------------------
    Description: 
A customer runs a Flink job with RocksDB state backend. Checkpoints are 
retained and done incrementally. The state size is several TB. When they 
restore + downscale from a retained checkpoint, although the downloading of 
checkpoint files took ~20min, the job throughput returns to the expected level 
only after 3 hours.  

I do not have RocksDB logs. The suspicion for those 3 hours is due to heavy 
RocksDB compaction and/or flush. As it was observed that checkpoint could not 
finish faster enough due to long {{checkpoint duration (sync)}}. How can we 
make this restoring phase shorter? 

For compaction, I think it is worth to check the improvement of:
{code:c}
CompactionPri compaction_pri = kMinOverlappingRatio;{code}
which has been set to default in RocksDB 6.x:
{code:c}
// In Level-based compaction, it Determines which file from a level to be
// picked to merge to the next level. We suggest people try
// kMinOverlappingRatio first when you tune your database.
enum CompactionPri : char {
  // Slightly prioritize larger files by size compensated by #deletes
  kByCompensatedSize = 0x0,
  // First compact files whose data's latest update time is oldest.
  // Try this if you only update some hot keys in small ranges.
  kOldestLargestSeqFirst = 0x1,
  // First compact files whose range hasn't been compacted to the next level
  // for the longest. If your updates are random across the key space,
  // write amplification is slightly better with this option.
  kOldestSmallestSeqFirst = 0x2,
  // First compact files whose ratio between overlapping size in next level
  // and its size is the smallest. It in many cases can optimize write
  // amplification.
  kMinOverlappingRatio = 0x3,
};
...
// Default: kMinOverlappingRatio  
CompactionPri compaction_pri = kMinOverlappingRatio;{code}

  was:
A customer runs a Flink job with RocksDB state backend. Checkpoints are 
retained and done incrementally. The state size is several TB. When they 
restore + downscale from a retained checkpoint, although the downloading of 
checkpoint files took ~20min, the job throughput returns to the expected level 
only after 3 hours.  

I do not have RocksDB logs. The suspicion for those 3 hours is due to heavy 
RocksDB compaction and/or flush. As it was observed that checkpoint could not 
finish faster enough due to long {{checkpoint duration (sync)}}. How can we 
make this restoring phase shorter? 

For compaction, I think it is worth to check the improvement of:
{code:java}
CompactionPri compaction_pri = kMinOverlappingRatio;{code}
which has been set to default in RocksDB 6.x:
{code:java}
// In Level-based compaction, it Determines which file from a level to be
// picked to merge to the next level. We suggest people try
// kMinOverlappingRatio first when you tune your database.
enum CompactionPri : char {
  // Slightly prioritize larger files by size compensated by #deletes
  kByCompensatedSize = 0x0,
  // First compact files whose data's latest update time is oldest.
  // Try this if you only update some hot keys in small ranges.
  kOldestLargestSeqFirst = 0x1,
  // First compact files whose range hasn't been compacted to the next level
  // for the longest. If your updates are random across the key space,
  // write amplification is slightly better with this option.
  kOldestSmallestSeqFirst = 0x2,
  // First compact files whose ratio between overlapping size in next level
  // and its size is the smallest. It in many cases can optimize write
  // amplification.
  kMinOverlappingRatio = 0x3,
};
...
// Default: kMinOverlappingRatio  CompactionPri compaction_pri = 
kMinOverlappingRatio;{code}


> Flink Job runs slow after restore + downscale from an incremental checkpoint 
> (rocksdb)
> --------------------------------------------------------------------------------------
>
>                 Key: FLINK-19008
>                 URL: https://issues.apache.org/jira/browse/FLINK-19008
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Jun Qin
>            Priority: Major
>
> A customer runs a Flink job with RocksDB state backend. Checkpoints are 
> retained and done incrementally. The state size is several TB. When they 
> restore + downscale from a retained checkpoint, although the downloading of 
> checkpoint files took ~20min, the job throughput returns to the expected 
> level only after 3 hours.  
> I do not have RocksDB logs. The suspicion for those 3 hours is due to heavy 
> RocksDB compaction and/or flush. As it was observed that checkpoint could not 
> finish faster enough due to long {{checkpoint duration (sync)}}. How can we 
> make this restoring phase shorter? 
> For compaction, I think it is worth to check the improvement of:
> {code:c}
> CompactionPri compaction_pri = kMinOverlappingRatio;{code}
> which has been set to default in RocksDB 6.x:
> {code:c}
> // In Level-based compaction, it Determines which file from a level to be
> // picked to merge to the next level. We suggest people try
> // kMinOverlappingRatio first when you tune your database.
> enum CompactionPri : char {
>   // Slightly prioritize larger files by size compensated by #deletes
>   kByCompensatedSize = 0x0,
>   // First compact files whose data's latest update time is oldest.
>   // Try this if you only update some hot keys in small ranges.
>   kOldestLargestSeqFirst = 0x1,
>   // First compact files whose range hasn't been compacted to the next level
>   // for the longest. If your updates are random across the key space,
>   // write amplification is slightly better with this option.
>   kOldestSmallestSeqFirst = 0x2,
>   // First compact files whose ratio between overlapping size in next level
>   // and its size is the smallest. It in many cases can optimize write
>   // amplification.
>   kMinOverlappingRatio = 0x3,
> };
> ...
> // Default: kMinOverlappingRatio  
> CompactionPri compaction_pri = kMinOverlappingRatio;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-19008) Flink Job runs slow after restore + downscale from an incremental checkpoint (rocksdb)

Reply via email to