[ https://issues.apache.org/jira/browse/FLINK-27681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789994#comment-17789994 ]
Yue Ma commented on FLINK-27681: -------------------------------- [~masteryhx] Thanks for the explaining , I agree with it > Improve the availability of Flink when the RocksDB file is corrupted. > --------------------------------------------------------------------- > > Key: FLINK-27681 > URL: https://issues.apache.org/jira/browse/FLINK-27681 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends > Reporter: Ming Li > Assignee: Yue Ma > Priority: Critical > Labels: pull-request-available > Attachments: image-2023-08-23-15-06-16-717.png > > > We have encountered several times when the RocksDB checksum does not match or > the block verification fails when the job is restored. The reason for this > situation is generally that there are some problems with the machine where > the task is located, which causes the files uploaded to HDFS to be incorrect, > but it has been a long time (a dozen minutes to half an hour) when we found > this problem. I'm not sure if anyone else has had a similar problem. > Since this file is referenced by incremental checkpoints for a long time, > when the maximum number of checkpoints reserved is exceeded, we can only use > this file until it is no longer referenced. When the job failed, it cannot be > recovered. > Therefore we consider: > 1. Can RocksDB periodically check whether all files are correct and find the > problem in time? > 2. Can Flink automatically roll back to the previous checkpoint when there is > a problem with the checkpoint data, because even with manual intervention, it > just tries to recover from the existing checkpoint or discard the entire > state. > 3. Can we increase the maximum number of references to a file based on the > maximum number of checkpoints reserved? When the number of references exceeds > the maximum number of checkpoints -1, the Task side is required to upload a > new file for this reference. Not sure if this way will ensure that the new > file we upload will be correct. -- This message was sent by Atlassian Jira (v8.20.10#820010)