If a TaskManager fails, the data stored on it will be lost and needs to be recomputed. So even with the batch mode configured, more tasks might need a restart. To mitigate that, the Flink developers need to implement support for external shuffle services.
On Wed, Dec 16, 2020 at 9:10 AM Robert Metzger <rmetz...@apache.org> wrote: > With region failover strategy, all connected subtasks will fail. > > If you are using the DataSet API with env.getConfig().setExecutionMode( > ExecutionMode.BATCH);, you should get the desired behavior. > > On Mon, Dec 14, 2020 at 5:24 PM Stanislav Borissov <sk.boris...@gmail.com> > wrote: > >> Hi, >> >> I'm running a simple, "embarassingly parallel" ETL-type job. I noticed >> that a failure in one subtask causes the entire job to restart. Even with >> the region failover strategy, all subtasks of this task and connected ones >> would fail. Is there any way to limit restarting to only the single subtask >> that failed, so all other subtasks can stay alive and keep working? >> >> For context, I use Flink 1.11 in AWS Kinesis Data Analytics, so some >> configuration is not controlled by me >> <https://docs.aws.amazon.com/kinesisanalytics/latest/java/reference-flink-settings.title.html> >> . >> >> Thanks >> >