[ https://issues.apache.org/jira/browse/FLINK-37319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17927030#comment-17927030 ]
Zakelly Lan commented on FLINK-37319: ------------------------------------- If we encounter occasional errors when uploading to DFS, the checkpoint would fail. However, if the DFS recovers before the next checkpoint is triggered, the next one will succeed. Given that the checkpoint interval is typically in minutes (3 minutes by default), the retry interval we are discussing here is in seconds. A retry interval that is too long could cause the checkpoint to exceed the time limit (10 minutes by default). It is important to consider whether the DFS or object store can recover in seconds to determine if this improvement is useful. > Add retry in RocksDBStateUploader for fault tolerant > ---------------------------------------------------- > > Key: FLINK-37319 > URL: https://issues.apache.org/jira/browse/FLINK-37319 > Project: Flink > Issue Type: Improvement > Components: Runtime / Async State Processing > Affects Versions: 1.20.0, 1.20.1 > Reporter: Zhenqiu Huang > Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)