[ https://issues.apache.org/jira/browse/FLINK-21846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Flink Jira Bot updated FLINK-21846: ----------------------------------- Labels: auto-deprioritized-major reactive (was: reactive stale-major) Priority: Minor (was: Major) This issue was labeled "stale-major" 7 ago and has not received any updates so it is being deprioritized. If this ticket is actually Major, please raise the priority and ask a committer to assign you the issue or revive the public discussion. > Rethink whether failure of ExecutionGraph creation in Adaptive Scheduler > should directly fail the job > ----------------------------------------------------------------------------------------------------- > > Key: FLINK-21846 > URL: https://issues.apache.org/jira/browse/FLINK-21846 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.13.0 > Reporter: Till Rohrmann > Priority: Minor > Labels: auto-deprioritized-major, reactive > > Currently, the {{AdaptiveScheduler}} fails a job execution if the > {{ExecutionGraph}} creation fails. This can be problematic because the > failure could result from a transient problem (e.g. filesystem is currently > not available). In the case of a transient problem a job rescaling could lead > to a job failure which might be a bit surprising for users. Instead, I would > expect that Flink would retry the {{ExecutionGraph}} creation. > One idea could be to ask the restart policy for how to treat the failure and > whether to retry the {{ExecutionGraph}} creation or not. > One thing to keep in mind, though, is that some failure might be permanent > failures (e.g. wrongly specified savepoint path). In such as case we would > ideally fail immediately. One way to address this problem could be to try to > restore the savepoint once we create the {{AdaptiveScheduler}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)