Andrey Zagrebin created FLINK-19142:
---------------------------------------

             Summary: Investigate slot hijacking from preceding pipelined 
regions after failover
                 Key: FLINK-19142
                 URL: https://issues.apache.org/jira/browse/FLINK-19142
             Project: Flink
          Issue Type: Improvement
            Reporter: Andrey Zagrebin


The ticket originates from [this PR 
discussion|https://github.com/apache/flink/pull/13181#discussion_r481087221].

The previous AllocationIDs are used by PreviousAllocationSlotSelectionStrategy 
to schedule subtasks into the slot where they were previously executed before a 
failover. If the previous slot (AllocationID) is not available, we do not want 
subtasks to take previous slots (AllocationIDs) of other subtasks.

The MergingSharedSlotProfileRetriever gets all previous AllocationIDs of the 
bulk from SlotSharingExecutionSlotAllocator but only from the current bulk. The 
previous AllocationIDs of other bulks stay unknown. Therefore, the current bulk 
can potentially hijack the previous slots from the preceding bulks. On the 
other hand the previous AllocationIDs of other tasks should be taken if the 
other tasks are not going to run at the same time, e.g. not enough resources 
after failover or other bulks are done.

One way to do it may be to give to MergingSharedSlotProfileRetriever all 
previous AllocationIDs of bulks which are going to run at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to