[ 
https://issues.apache.org/jira/browse/FLINK-27681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791409#comment-17791409
 ] 

Hangxiang Yu commented on FLINK-27681:
--------------------------------------

{quote}Fail job directly is fine for me, but I guess the PR doesn't fail the 
job, it just fails the current checkpoint, right?
{quote}
Yeah, I think failing the checkpoint maybe also fine currently. It will not 
affect the correctness of the running job.

The downside is that the job has to rollback to the older checkpoint. But there 
should be some policies for high-quality job just as [~mayuehappy] mentioned.
{quote}If the checksum is called for each reading, can we think the check is 
very quick? If so, could we enable it directly without any option? Hey 
[~mayuehappy]  , could you provide some simple benchmark here?
{quote}
The check at runtime is block level, whose overhead should be little (rocksdb 
always need to read the block from the disk at runtime, so the checksum could 
be calculated easily).

But the checksum in file level will always be done with extra overhead, and the 
overhead will be bigger if the state is very large, so that's why I'd like to 
suggest it as an option. Also appreciate and look forward the benchmark result 
of [~mayuehappy] 

> Improve the availability of Flink when the RocksDB file is corrupted.
> ---------------------------------------------------------------------
>
>                 Key: FLINK-27681
>                 URL: https://issues.apache.org/jira/browse/FLINK-27681
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / State Backends
>            Reporter: Ming Li
>            Assignee: Yue Ma
>            Priority: Critical
>              Labels: pull-request-available
>         Attachments: image-2023-08-23-15-06-16-717.png
>
>
> We have encountered several times when the RocksDB checksum does not match or 
> the block verification fails when the job is restored. The reason for this 
> situation is generally that there are some problems with the machine where 
> the task is located, which causes the files uploaded to HDFS to be incorrect, 
> but it has been a long time (a dozen minutes to half an hour) when we found 
> this problem. I'm not sure if anyone else has had a similar problem.
> Since this file is referenced by incremental checkpoints for a long time, 
> when the maximum number of checkpoints reserved is exceeded, we can only use 
> this file until it is no longer referenced. When the job failed, it cannot be 
> recovered.
> Therefore we consider:
> 1. Can RocksDB periodically check whether all files are correct and find the 
> problem in time?
> 2. Can Flink automatically roll back to the previous checkpoint when there is 
> a problem with the checkpoint data, because even with manual intervention, it 
> just tries to recover from the existing checkpoint or discard the entire 
> state.
> 3. Can we increase the maximum number of references to a file based on the 
> maximum number of checkpoints reserved? When the number of references exceeds 
> the maximum number of checkpoints -1, the Task side is required to upload a 
> new file for this reference. Not sure if this way will ensure that the new 
> file we upload will be correct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to