[jira] [Commented] (FLINK-21726) Fix checkpoint stuck

Yun Tang (Jira) Fri, 19 Mar 2021 21:27:08 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-21726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305300#comment-17305300
 ]


Yun Tang commented on FLINK-21726:
----------------------------------

[~trohrmann] sorry for late reply as apache Jira failed to send any 
notifications to me again. As RocksDB adopts LEVEL compaction as default and 
streaming computing cares more on read latency while UNIVERSIAL compaction is 
better for write amplification but much worse for read amplification. From our 
knowledge and development experiences on internal LSM like DB, I think our 
users should not have high probability to run into this problem.

[~fanrui] Thanks for your enthusiasm to fix this problem and contribute back to 
RocksDB community. We already planed to bump FrocksDB version to latest one in 
next Flink release with byte buffer improvement to fill the gap of current 
performance regression and could include your fix then.

> Fix checkpoint stuck
> --------------------
>
>                 Key: FLINK-21726
>                 URL: https://issues.apache.org/jira/browse/FLINK-21726
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / State Backends
>    Affects Versions: 1.11.3, 1.12.2, 1.13.0
>            Reporter: fanrui
>            Priority: Critical
>             Fix For: 1.13.0
>
>
> h1. 1. Bug description:
> When RocksDB Checkpoint, it may be stuck in 
> `WaitUntilFlushWouldNotStallWrites` method.
> h1. 2. Simple analysis of the reasons:
> h2. 2.1 Configuration parameters:
>  
> {code:java}
> # Flink yaml:
> state.backend.rocksdb.predefined-options: SPINNING_DISK_OPTIMIZED_HIGH_MEM
> state.backend.rocksdb.compaction.style: UNIVERSAL
> # corresponding RocksDB config
> Compaction Style : Universal 
> max_write_buffer_number : 4
> min_write_buffer_number_to_merge : 3{code}
> Checkpoint is usually very fast. When the Checkpoint is executed, 
> `WaitUntilFlushWouldNotStallWrites` is called. If there are 2 Immutable 
> MemTables, which are less than `min_write_buffer_number_to_merge`, they will 
> not be flushed. But will enter this code.
>  
> {code:java}
> // method: GetWriteStallConditionAndCause
> if (mutable_cf_options.max_write_buffer_number> 3 &&
>               num_unflushed_memtables >=
>                   mutable_cf_options.max_write_buffer_number-1) {
>      return {WriteStallCondition::kDelayed, WriteStallCause::kMemtableLimit};
> }
> {code}
> code link: 
> [https://github.com/facebook/rocksdb/blob/fbed72f03c3d9e4fdca3e5993587ef2559ba6ab9/db/column_family.cc#L847]
> Checkpoint thought there was a FlushJob, but it didn't. So will always wait.
> h2. 2.2 solution:
> Increase the restriction: the `number of Immutable MemTable` >= 
> `min_write_buffer_number_to_merge will wait`.
> The rocksdb community has fixed this bug, link: 
> [https://github.com/facebook/rocksdb/pull/7921]
> h2. 2.3 Code that can reproduce the bug:
> [https://github.com/1996fanrui/fanrui-learning/blob/flink-1.12/module-java/src/main/java/com/dream/rocksdb/RocksDBCheckpointStuck.java]
> h1. 3. Interesting point
> This bug will be triggered only when `the number of sorted runs >= 
> level0_file_num_compaction_trigger`.
> Because there is a break in WaitUntilFlushWouldNotStallWrites.
> {code:java}
> if (cfd->imm()->NumNotFlushed() <
>         cfd->ioptions()->min_write_buffer_number_to_merge &&
>     vstorage->l0_delay_trigger_count() <
>         mutable_cf_options.level0_file_num_compaction_trigger) {
>   break;
> }
> {code}
> code link: 
> [https://github.com/facebook/rocksdb/blob/fbed72f03c3d9e4fdca3e5993587ef2559ba6ab9/db/db_impl/db_impl_compaction_flush.cc#L1974]
> Universal may have `l0_delay_trigger_count() >= 
> level0_file_num_compaction_trigger`, so this bug is triggered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-21726) Fix checkpoint stuck

Reply via email to