[ https://issues.apache.org/jira/browse/FLINK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aljoscha Krettek updated FLINK-3431: ------------------------------------ Component/s: (was: API / DataStream) > Add retrying logic for RocksDB snapshots > ---------------------------------------- > > Key: FLINK-3431 > URL: https://issues.apache.org/jira/browse/FLINK-3431 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends > Reporter: Gyula Fora > Assignee: Shuyi Chen > Priority: Critical > > Currently the RocksDB snapshots rely on hdfs copy not failing while taking > the snapshots. > In some cases when the state size is big enough the HDFS nodes might get so > overloaded that the copy operation fails on errors like this: > AsynchronousException{java.io.IOException: All datanodes 172.26.86.90:50010 > are bad. Aborting...} > at > org.apache.flink.streaming.runtime.tasks.StreamTask$1.run(StreamTask.java:545) > Caused by: java.io.IOException: All datanodes 172.26.86.90:50010 are bad. > Aborting... > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1023) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:838) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:483) > I think it would be important that we don't immediately fail the job in these > cases but retry the copy operation after some random sleep time. It might be > also good to do a random sleep before the copy depending on the state size to > smoothen out IO a little bit. -- This message was sent by Atlassian Jira (v8.3.4#803005)