[ 
https://issues.apache.org/jira/browse/FLINK-27681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791599#comment-17791599
 ] 

Piotr Nowojski commented on FLINK-27681:
----------------------------------------

{quote}

However,  the job must fail in the future(When the corrupted block is read or 
compacted, or checkpoint failed number >= tolerable-failed-checkpoint). Then it 
will rollback to the older checkpoint.

The older checkpoint must be before we found the file is corrupted. Therefore, 
it is useless to run a job between the time it is discovered that the file is 
corrupted and the time it actually fails.

In brief, tolerable-failed-checkpoint can work, but the extra cost isn't 
necessary.

{quote}

^^^ This

{quote}

I did some testing on my local machine. It takes about 60 to 70ms to check a 
64M sst file. Checking a 10GB rocksdb instance takes about 10 seconds. More 
detailed testing may be needed later.

{quote}

That's a non trivial overhead. Prolonging checkpoint for 10s in many cases 
(especially low throughput large state jobs) will be prohibitively expensive, 
delaying rescaling, e2e exactly once latency, etc. 1s+ for 1GB might also be 
less then ideal to enable by default.

{quote}

I'm not familiar with how to enable CRC for filesystem/disk? Would you mind 
describing it in detail?

{quote}

Actually, aren't all of the disks basically have some form of CRC these days? 
I'm certain that's true about SSDs. Having said that, can you [~masteryhx] 
rephrase and elaborate on those 3 scenarios that you think we need to protect 
from? Especially where does the corruption happen?

> Improve the availability of Flink when the RocksDB file is corrupted.
> ---------------------------------------------------------------------
>
>                 Key: FLINK-27681
>                 URL: https://issues.apache.org/jira/browse/FLINK-27681
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / State Backends
>            Reporter: Ming Li
>            Assignee: Yue Ma
>            Priority: Critical
>              Labels: pull-request-available
>         Attachments: image-2023-08-23-15-06-16-717.png
>
>
> We have encountered several times when the RocksDB checksum does not match or 
> the block verification fails when the job is restored. The reason for this 
> situation is generally that there are some problems with the machine where 
> the task is located, which causes the files uploaded to HDFS to be incorrect, 
> but it has been a long time (a dozen minutes to half an hour) when we found 
> this problem. I'm not sure if anyone else has had a similar problem.
> Since this file is referenced by incremental checkpoints for a long time, 
> when the maximum number of checkpoints reserved is exceeded, we can only use 
> this file until it is no longer referenced. When the job failed, it cannot be 
> recovered.
> Therefore we consider:
> 1. Can RocksDB periodically check whether all files are correct and find the 
> problem in time?
> 2. Can Flink automatically roll back to the previous checkpoint when there is 
> a problem with the checkpoint data, because even with manual intervention, it 
> just tries to recover from the existing checkpoint or discard the entire 
> state.
> 3. Can we increase the maximum number of references to a file based on the 
> maximum number of checkpoints reserved? When the number of references exceeds 
> the maximum number of checkpoints -1, the Task side is required to upload a 
> new file for this reference. Not sure if this way will ensure that the new 
> file we upload will be correct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to