[ https://issues.apache.org/jira/browse/FLINK-21726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305300#comment-17305300 ]
Yun Tang commented on FLINK-21726: ---------------------------------- [~trohrmann] sorry for late reply as apache Jira failed to send any notifications to me again. As RocksDB adopts LEVEL compaction as default and streaming computing cares more on read latency while UNIVERSIAL compaction is better for write amplification but much worse for read amplification. From our knowledge and development experiences on internal LSM like DB, I think our users should not have high probability to run into this problem. [~fanrui] Thanks for your enthusiasm to fix this problem and contribute back to RocksDB community. We already planed to bump FrocksDB version to latest one in next Flink release with byte buffer improvement to fill the gap of current performance regression and could include your fix then. > Fix checkpoint stuck > -------------------- > > Key: FLINK-21726 > URL: https://issues.apache.org/jira/browse/FLINK-21726 > Project: Flink > Issue Type: Bug > Components: Runtime / State Backends > Affects Versions: 1.11.3, 1.12.2, 1.13.0 > Reporter: fanrui > Priority: Critical > Fix For: 1.13.0 > > > h1. 1. Bug description: > When RocksDB Checkpoint, it may be stuck in > `WaitUntilFlushWouldNotStallWrites` method. > h1. 2. Simple analysis of the reasons: > h2. 2.1 Configuration parameters: > > {code:java} > # Flink yaml: > state.backend.rocksdb.predefined-options: SPINNING_DISK_OPTIMIZED_HIGH_MEM > state.backend.rocksdb.compaction.style: UNIVERSAL > # corresponding RocksDB config > Compaction Style : Universal > max_write_buffer_number : 4 > min_write_buffer_number_to_merge : 3{code} > Checkpoint is usually very fast. When the Checkpoint is executed, > `WaitUntilFlushWouldNotStallWrites` is called. If there are 2 Immutable > MemTables, which are less than `min_write_buffer_number_to_merge`, they will > not be flushed. But will enter this code. > > {code:java} > // method: GetWriteStallConditionAndCause > if (mutable_cf_options.max_write_buffer_number> 3 && > num_unflushed_memtables >= > mutable_cf_options.max_write_buffer_number-1) { > return {WriteStallCondition::kDelayed, WriteStallCause::kMemtableLimit}; > } > {code} > code link: > [https://github.com/facebook/rocksdb/blob/fbed72f03c3d9e4fdca3e5993587ef2559ba6ab9/db/column_family.cc#L847] > Checkpoint thought there was a FlushJob, but it didn't. So will always wait. > h2. 2.2 solution: > Increase the restriction: the `number of Immutable MemTable` >= > `min_write_buffer_number_to_merge will wait`. > The rocksdb community has fixed this bug, link: > [https://github.com/facebook/rocksdb/pull/7921] > h2. 2.3 Code that can reproduce the bug: > [https://github.com/1996fanrui/fanrui-learning/blob/flink-1.12/module-java/src/main/java/com/dream/rocksdb/RocksDBCheckpointStuck.java] > h1. 3. Interesting point > This bug will be triggered only when `the number of sorted runs >= > level0_file_num_compaction_trigger`. > Because there is a break in WaitUntilFlushWouldNotStallWrites. > {code:java} > if (cfd->imm()->NumNotFlushed() < > cfd->ioptions()->min_write_buffer_number_to_merge && > vstorage->l0_delay_trigger_count() < > mutable_cf_options.level0_file_num_compaction_trigger) { > break; > } > {code} > code link: > [https://github.com/facebook/rocksdb/blob/fbed72f03c3d9e4fdca3e5993587ef2559ba6ab9/db/db_impl/db_impl_compaction_flush.cc#L1974] > Universal may have `l0_delay_trigger_count() >= > level0_file_num_compaction_trigger`, so this bug is triggered. -- This message was sent by Atlassian Jira (v8.3.4#803005)