[ https://issues.apache.org/jira/browse/FLINK-27681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791599#comment-17791599 ]
Piotr Nowojski commented on FLINK-27681: ---------------------------------------- {quote} However, the job must fail in the future(When the corrupted block is read or compacted, or checkpoint failed number >= tolerable-failed-checkpoint). Then it will rollback to the older checkpoint. The older checkpoint must be before we found the file is corrupted. Therefore, it is useless to run a job between the time it is discovered that the file is corrupted and the time it actually fails. In brief, tolerable-failed-checkpoint can work, but the extra cost isn't necessary. {quote} ^^^ This {quote} I did some testing on my local machine. It takes about 60 to 70ms to check a 64M sst file. Checking a 10GB rocksdb instance takes about 10 seconds. More detailed testing may be needed later. {quote} That's a non trivial overhead. Prolonging checkpoint for 10s in many cases (especially low throughput large state jobs) will be prohibitively expensive, delaying rescaling, e2e exactly once latency, etc. 1s+ for 1GB might also be less then ideal to enable by default. {quote} I'm not familiar with how to enable CRC for filesystem/disk? Would you mind describing it in detail? {quote} Actually, aren't all of the disks basically have some form of CRC these days? I'm certain that's true about SSDs. Having said that, can you [~masteryhx] rephrase and elaborate on those 3 scenarios that you think we need to protect from? Especially where does the corruption happen? > Improve the availability of Flink when the RocksDB file is corrupted. > --------------------------------------------------------------------- > > Key: FLINK-27681 > URL: https://issues.apache.org/jira/browse/FLINK-27681 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends > Reporter: Ming Li > Assignee: Yue Ma > Priority: Critical > Labels: pull-request-available > Attachments: image-2023-08-23-15-06-16-717.png > > > We have encountered several times when the RocksDB checksum does not match or > the block verification fails when the job is restored. The reason for this > situation is generally that there are some problems with the machine where > the task is located, which causes the files uploaded to HDFS to be incorrect, > but it has been a long time (a dozen minutes to half an hour) when we found > this problem. I'm not sure if anyone else has had a similar problem. > Since this file is referenced by incremental checkpoints for a long time, > when the maximum number of checkpoints reserved is exceeded, we can only use > this file until it is no longer referenced. When the job failed, it cannot be > recovered. > Therefore we consider: > 1. Can RocksDB periodically check whether all files are correct and find the > problem in time? > 2. Can Flink automatically roll back to the previous checkpoint when there is > a problem with the checkpoint data, because even with manual intervention, it > just tries to recover from the existing checkpoint or discard the entire > state. > 3. Can we increase the maximum number of references to a file based on the > maximum number of checkpoints reserved? When the number of references exceeds > the maximum number of checkpoints -1, the Task side is required to upload a > new file for this reference. Not sure if this way will ensure that the new > file we upload will be correct. -- This message was sent by Atlassian Jira (v8.20.10#820010)