[ https://issues.apache.org/jira/browse/FLINK-19303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gabor Somogyi resolved FLINK-19303. ----------------------------------- Resolution: Fixed > Disable WAL in RocksDB recovery > ------------------------------- > > Key: FLINK-19303 > URL: https://issues.apache.org/jira/browse/FLINK-19303 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends > Reporter: Juha Mynttinen > Assignee: Juha Mynttinen > Priority: Minor > > During recovery of {{RocksDBStateBackend}} the recovery mechanism puts the > key value pairs to local RocksDB instance(s). To speed up the process, the > recovery process uses RocskDB write batch mechanism. [RocksDB > WAL|https://github.com/facebook/rocksdb/wiki/Write-Ahead-Log] is enabled > during this process. > During normal operations, i.e. when the state backend has been recovered and > the Flink application is running (on RocksDB state backend) WAL is disabled. > The recovery process doesn't need WAL. In fact the recovery should be much > faster without WAL. Thus, WAL should be disabled in the recovery process. > AFAIK the last thing that was done with WAL during recovery was an attempt to > remove it. Later that removal was removed because it causes stability issues > (https://issues.apache.org/jira/browse/FLINK-8922). > Unfortunately the root cause why disabling WAL causes segfault during > recovery is unknown. After all, WAL is not used during normal operations. > Potential explanation is some kind of bug in RocksDB write batch when using > WAL. It is possible later RocksDB versions have fixes / workarounds for the > issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)