[ https://issues.apache.org/jira/browse/FLINK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Flink Jira Bot updated FLINK-3431: ---------------------------------- Labels: auto-deprioritized-critical auto-unassigned (was: auto-unassigned stale-critical) Priority: Major (was: Critical) This issue was labeled "stale-critical" 7 ago and has not received any updates so it is being deprioritized. If this ticket is actually Critical, please raise the priority and ask a committer to assign you the issue or revive the public discussion. > Add retrying logic for RocksDB snapshots > ---------------------------------------- > > Key: FLINK-3431 > URL: https://issues.apache.org/jira/browse/FLINK-3431 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends > Reporter: Gyula Fora > Priority: Major > Labels: auto-deprioritized-critical, auto-unassigned > > Currently the RocksDB snapshots rely on hdfs copy not failing while taking > the snapshots. > In some cases when the state size is big enough the HDFS nodes might get so > overloaded that the copy operation fails on errors like this: > AsynchronousException{java.io.IOException: All datanodes 172.26.86.90:50010 > are bad. Aborting...} > at > org.apache.flink.streaming.runtime.tasks.StreamTask$1.run(StreamTask.java:545) > Caused by: java.io.IOException: All datanodes 172.26.86.90:50010 are bad. > Aborting... > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1023) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:838) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:483) > I think it would be important that we don't immediately fail the job in these > cases but retry the copy operation after some random sleep time. It might be > also good to do a random sleep before the copy depending on the state size to > smoothen out IO a little bit. -- This message was sent by Atlassian Jira (v8.3.4#803005)