Paris Carbone created FLINK-3256: ------------------------------------ Summary: Invalid execution graph cleanup for jobs with colocation groups Key: FLINK-3256 URL: https://issues.apache.org/jira/browse/FLINK-3256 Project: Flink Issue Type: Bug Components: Distributed Runtime Reporter: Paris Carbone Assignee: Paris Carbone Priority: Blocker
Currently, upon restarting an execution graph, we clean-up the colocation constraints for each group present in an ExecutionJobVertex respectively. This can lead to invalid reconfiguration upon a restart or any other activity that relies on state cleanup of the execution graph. For example, upon restarting a DataStream job with iterations the following steps are executed: 1) IterationSource colocation group constraints are reset 2) New IterationSource colocation group constraints are generated 3) IterationSource subtasks are scheduled with current colocation constraints 4) IterationSink colocation group constraints are reset 5) New IterationSink colocation group constraints are generated 6) IterationSink subtasks are scheduled with different colocation constraints, thus, not being colocated with sources while also demanding more slots from the scheduler. This can be trivially fixed by reseting colocation groups independently from ExecutionJobVertices, thus, updating them once per reconfiguration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)