Till Rohrmann created FLINK-21846:
-------------------------------------
Summary: Rethink whether failure of ExecutionGraph creation should
directly fail the job
Key: FLINK-21846
URL: https://issues.apache.org/jira/browse/FLINK-21846
Project: Flink
Issue Type: Sub-task
Components: Runtime / Coordination
Affects Versions: 1.13.0
Reporter: Till Rohrmann
Fix For: 1.13.0
Currently, the {{AdaptiveScheduler}} fails a job execution if the
{{ExecutionGraph}} creation fails. This can be problematic because the failure
could result from a transient problem (e.g. filesystem is currently not
available). In the case of a transient problem a job rescaling could lead to a
job failure which might be a bit surprising for users. Instead, I would expect
that Flink would retry the {{ExecutionGraph}} creation.
One idea could be to ask the restart policy for how to treat the failure and
whether to retry the {{ExecutionGraph}} creation or not.
One thing to keep in mind, though, is that some failure might be permanent
failures (e.g. wrongly specified savepoint path). In such as case we would
ideally fail immediately.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)