[jira] [Commented] (FLINK-37319) Add retry in RocksDBStateUploader for fault tolerant

Zakelly Lan (Jira) Thu, 13 Feb 2025 19:12:04 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-37319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17927030#comment-17927030
 ]


Zakelly Lan commented on FLINK-37319:
-------------------------------------

If we encounter occasional errors when uploading to DFS, the checkpoint would 
fail. However, if the DFS recovers before the next checkpoint is triggered, the 
next one will succeed. Given that the checkpoint interval is typically in 
minutes (3 minutes by default), the retry interval we are discussing here is in 
seconds. A retry interval that is too long could cause the checkpoint to exceed 
the time limit (10 minutes by default). It is important to consider whether the 
DFS or object store can recover in seconds to determine if this improvement is 
useful.

> Add retry in RocksDBStateUploader for fault tolerant
> ----------------------------------------------------
>
>                 Key: FLINK-37319
>                 URL: https://issues.apache.org/jira/browse/FLINK-37319
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Async State Processing
>    Affects Versions: 1.20.0, 1.20.1
>            Reporter: Zhenqiu Huang
>            Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-37319) Add retry in RocksDBStateUploader for fault tolerant

Reply via email to