[jira] [Commented] (FLINK-27681) Improve the availability of Flink when the RocksDB file is corrupted.

Hangxiang Yu (Jira) Thu, 07 Dec 2023 05:08:31 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-27681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17794190#comment-17794190
 ]


Hangxiang Yu commented on FLINK-27681:
--------------------------------------

{quote}Yes, I think we don't need any extra protection for corruption of the 
local files. From the document you shared RocksDB will throw some error every 
time we try to read a corrupted block
{quote}
Yes, reading a corrupted block must be checked which is safe.

But the write operation (e.g. flush, compaction) may introduce a new corrupted 
file which may not be checked.

And the corrupted file maybe just uploaded to remote storage without any check 
like reading block checksum when checkpoint if we don't check it manually.
{quote}Now I'm not so sure about it. Now that I think about it more, checksums 
on the filesystem level or the HDD/SSD level wouldn't protect us from a 
corruption happening after reading the bytes from local file, but before those 
bytes are acknowledged by the DFS/object store. 
{quote}
Yes, you're right. That's what I mentioned before about the end-to-end checksum 
(verify the file correctness from local to remote by unified checksum). And 
Thanks for sharing detailed infos about S3.

"But this may introduce a new API in some public classes like FileSystem which 
is a bigger topic." , maybe need a FLIP ?

We also have tried to add this end-to-end checksum in our internal Flink 
version which is doable for many file systems.

We could also contribute this after we have verified the benefits and 
performance cost if worthy doing.

> Improve the availability of Flink when the RocksDB file is corrupted.
> ---------------------------------------------------------------------
>
>                 Key: FLINK-27681
>                 URL: https://issues.apache.org/jira/browse/FLINK-27681
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / State Backends
>            Reporter: Ming Li
>            Assignee: Yue Ma
>            Priority: Critical
>              Labels: pull-request-available
>         Attachments: image-2023-08-23-15-06-16-717.png
>
>
> We have encountered several times when the RocksDB checksum does not match or 
> the block verification fails when the job is restored. The reason for this 
> situation is generally that there are some problems with the machine where 
> the task is located, which causes the files uploaded to HDFS to be incorrect, 
> but it has been a long time (a dozen minutes to half an hour) when we found 
> this problem. I'm not sure if anyone else has had a similar problem.
> Since this file is referenced by incremental checkpoints for a long time, 
> when the maximum number of checkpoints reserved is exceeded, we can only use 
> this file until it is no longer referenced. When the job failed, it cannot be 
> recovered.
> Therefore we consider:
> 1. Can RocksDB periodically check whether all files are correct and find the 
> problem in time?
> 2. Can Flink automatically roll back to the previous checkpoint when there is 
> a problem with the checkpoint data, because even with manual intervention, it 
> just tries to recover from the existing checkpoint or discard the entire 
> state.
> 3. Can we increase the maximum number of references to a file based on the 
> maximum number of checkpoints reserved? When the number of references exceeds 
> the maximum number of checkpoints -1, the Task side is required to upload a 
> new file for this reference. Not sure if this way will ensure that the new 
> file we upload will be correct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-27681) Improve the availability of Flink when the RocksDB file is corrupted.

Reply via email to