[ https://issues.apache.org/jira/browse/FLINK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Flink Jira Bot updated FLINK-3431: ---------------------------------- Labels: auto-deprioritized-critical auto-unassigned stale-major (was: auto-deprioritized-critical auto-unassigned) I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help the community manage its development. I see this issues has been marked as Major but is unassigned and neither itself nor its Sub-Tasks have been updated for 30 days. I have gone ahead and added a "stale-major" to the issue". If this ticket is a Major, please either assign yourself or give an update. Afterwards, please remove the label or in 7 days the issue will be deprioritized. > Add retrying logic for RocksDB snapshots > ---------------------------------------- > > Key: FLINK-3431 > URL: https://issues.apache.org/jira/browse/FLINK-3431 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends > Reporter: Gyula Fora > Priority: Major > Labels: auto-deprioritized-critical, auto-unassigned, stale-major > > Currently the RocksDB snapshots rely on hdfs copy not failing while taking > the snapshots. > In some cases when the state size is big enough the HDFS nodes might get so > overloaded that the copy operation fails on errors like this: > AsynchronousException{java.io.IOException: All datanodes 172.26.86.90:50010 > are bad. Aborting...} > at > org.apache.flink.streaming.runtime.tasks.StreamTask$1.run(StreamTask.java:545) > Caused by: java.io.IOException: All datanodes 172.26.86.90:50010 are bad. > Aborting... > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1023) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:838) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:483) > I think it would be important that we don't immediately fail the job in these > cases but retry the copy operation after some random sleep time. It might be > also good to do a random sleep before the copy depending on the state size to > smoothen out IO a little bit. -- This message was sent by Atlassian Jira (v8.3.4#803005)