[jira] [Commented] (FLINK-27681) Improve the availability of Flink when the RocksDB file is corrupted.

Rui Fan (Jira) Wed, 29 Nov 2023 22:56:05 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-27681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791485#comment-17791485
 ]


Rui Fan commented on FLINK-27681:
---------------------------------

Hey [~mayuehappy]  [~masteryhx] , thanks for your feedback.:)
{quote}The downside is that the job has to rollback to the older checkpoint. 
But there should be some policies for high-quality job just as [~mayuehappy] 
mentioned.
{quote}
My concern is that if we found the file is corrupted, and fail the checkpoint. 
The job will continue to run (if tolerable-failed-checkpoints > 0),  and all 
checkpoints cannot be completed in the future.

However,  the job must fail in the future(When the corrupted block is read or 
compacted, or checkpoint failed number >= tolerable-failed-checkpoint). Then it 
will rollback to the older checkpoint.

The older checkpoint must be before we found the file is corrupted. Therefore, 
it is useless to run a job between the time it is discovered that the file is 
corrupted and the time it actually fails.

In brief, tolerable-failed-checkpoint can work, but the extra cost isn't 
necessary.

BTW, if failing job directly, this 
[comment|https://github.com/apache/flink/pull/23765#discussion_r1404136470] 
will be solved directly.
{quote}The check at runtime is block level, whose overhead should be little 
(rocksdb always need to read the block from the disk at runtime, so the 
checksum could be calculated easily).
{quote}
Thanks [~masteryhx] for the clarification.

 
{quote}Wouldn't the much more reliable and faster solution be to enable CRC on 
the local filesystem/disk that Flink's using? Benefits of this approach:
 * no changes to Flink/no increased complexity of our code base
 * would protect from not only errors that happen to occur between writing the 
file and uploading to the DFS, but also from any errors that happen at any 
point of time
 * would amortise the performance hit. Instead of amplifying reads by 100%, 
error correction bits/bytes are a small fraction of the payload, so the 
performance penalty would be at every read/write access but ultimately a very 
small fraction of the total cost of reading{quote}
[~pnowojski] 's comment also directly causes the job to fail? I'm not familiar 
with how to enable CRC for filesystem/disk? Would you mind describing it in 
detail?

> Improve the availability of Flink when the RocksDB file is corrupted.
> ---------------------------------------------------------------------
>
>                 Key: FLINK-27681
>                 URL: https://issues.apache.org/jira/browse/FLINK-27681
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / State Backends
>            Reporter: Ming Li
>            Assignee: Yue Ma
>            Priority: Critical
>              Labels: pull-request-available
>         Attachments: image-2023-08-23-15-06-16-717.png
>
>
> We have encountered several times when the RocksDB checksum does not match or 
> the block verification fails when the job is restored. The reason for this 
> situation is generally that there are some problems with the machine where 
> the task is located, which causes the files uploaded to HDFS to be incorrect, 
> but it has been a long time (a dozen minutes to half an hour) when we found 
> this problem. I'm not sure if anyone else has had a similar problem.
> Since this file is referenced by incremental checkpoints for a long time, 
> when the maximum number of checkpoints reserved is exceeded, we can only use 
> this file until it is no longer referenced. When the job failed, it cannot be 
> recovered.
> Therefore we consider:
> 1. Can RocksDB periodically check whether all files are correct and find the 
> problem in time?
> 2. Can Flink automatically roll back to the previous checkpoint when there is 
> a problem with the checkpoint data, because even with manual intervention, it 
> just tries to recover from the existing checkpoint or discard the entire 
> state.
> 3. Can we increase the maximum number of references to a file based on the 
> maximum number of checkpoints reserved? When the number of references exceeds 
> the maximum number of checkpoints -1, the Task side is required to upload a 
> new file for this reference. Not sure if this way will ensure that the new 
> file we upload will be correct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-27681) Improve the availability of Flink when the RocksDB file is corrupted.

Reply via email to