[jira] [Commented] (FLINK-27681) Improve the availability of Flink when the RocksDB file is corrupted.

Yue Ma (Jira) Mon, 27 Nov 2023 20:30:05 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-27681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790359#comment-17790359
 ]


Yue Ma commented on FLINK-27681:
--------------------------------

 
{quote}Fail job directly is fine for me, but I guess the PR doesn't fail the 
job, it just fails the current checkpoint, right?
{quote}
I think it may be used together with the 
{*}execution.checkpointing.tolerable-failed-checkpoints{*}, or generally 
speaking, if it is a high-quality job, users will also pay attention to whether 
the cp production is successful.
{quote}could you provide some simple benchmark here?
{quote}
I did some testing on my local machine. It takes about 60 to 70ms to check a 
64M sst file. Checking a 10GB rocksdb instance takes about 10 seconds. More 
detailed testing may be needed later.


 

> Improve the availability of Flink when the RocksDB file is corrupted.
> ---------------------------------------------------------------------
>
>                 Key: FLINK-27681
>                 URL: https://issues.apache.org/jira/browse/FLINK-27681
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / State Backends
>            Reporter: Ming Li
>            Assignee: Yue Ma
>            Priority: Critical
>              Labels: pull-request-available
>         Attachments: image-2023-08-23-15-06-16-717.png
>
>
> We have encountered several times when the RocksDB checksum does not match or 
> the block verification fails when the job is restored. The reason for this 
> situation is generally that there are some problems with the machine where 
> the task is located, which causes the files uploaded to HDFS to be incorrect, 
> but it has been a long time (a dozen minutes to half an hour) when we found 
> this problem. I'm not sure if anyone else has had a similar problem.
> Since this file is referenced by incremental checkpoints for a long time, 
> when the maximum number of checkpoints reserved is exceeded, we can only use 
> this file until it is no longer referenced. When the job failed, it cannot be 
> recovered.
> Therefore we consider:
> 1. Can RocksDB periodically check whether all files are correct and find the 
> problem in time?
> 2. Can Flink automatically roll back to the previous checkpoint when there is 
> a problem with the checkpoint data, because even with manual intervention, it 
> just tries to recover from the existing checkpoint or discard the entire 
> state.
> 3. Can we increase the maximum number of references to a file based on the 
> maximum number of checkpoints reserved? When the number of references exceeds 
> the maximum number of checkpoints -1, the Task side is required to upload a 
> new file for this reference. Not sure if this way will ensure that the new 
> file we upload will be correct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-27681) Improve the availability of Flink when the RocksDB file is corrupted.

Reply via email to