[ https://issues.apache.org/jira/browse/FLINK-37319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17926984#comment-17926984 ]
Zhenqiu Huang commented on FLINK-37319: --------------------------------------- For the Flink applications that use cloud storage as state backend, the checkpoint failure could happen due to the object store DR (Some times it is pretty often with 90 seconds timeout). This, it will be great to add additional layer of retry in the RocksDBStateUploader to handle with the transient failure. > Add retry in RocksDBStateUploader for fault tolerant > ---------------------------------------------------- > > Key: FLINK-37319 > URL: https://issues.apache.org/jira/browse/FLINK-37319 > Project: Flink > Issue Type: Improvement > Reporter: Zhenqiu Huang > Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)