Hi Flink Community, I'm currently running a heavy flink job on Flink 1.9.3 that has a lot of subtasks and observing some subtask distribution issues. The job in question has 9288 sub tasks and they are running on a large set of TMs (total available slots are 1792).
I'm using the *cluster.evenly-spread-out-slots* configuration option to have the slots be allocated uniformly but I am still seeing non-uniform subtask distribution that seems to be affecting performance. It looks like some TMs are being overloaded and seeing a much greater than uniform allocation of subtasks. I've been trying to reproduce this situation at a smaller scale but have not been successful in doing so. As part of debugging the scheduling process when trying to reproduce this at a smaller scale I observed that the non-location preference selectWithoutLocationPreference override introduced by the evenly spread out strategy option is not being invoked at all as the execution vertices still have a location preference to be assigned the same slots as their input vertices. This happens at job startup time and not during recovery, so I'm not sure if recovery is where the non preference code path is invoked. In essence the only impact of using the evenly spread out strategy seems to be a slightly different candidate score calculation. I wanted to know:- 1. Is the evenly spread out strategy the right option to choose for achieving the uniform distribution of subtasks? 2. Is the observed scheduling behaviour expected for the evenly spread out strategy? When do we expect the non location preference code path to be invoked? For us this only happens on sources since they have no incoming edges. Apart from this I am still trying to understand the nature of scheduling in Flink and how that could bring about this situation, I was wondering if there were known issues or peculiarities of the Flink job scheduler that could lead to this situation occurring. For example I'm looking at the known issues mentioned in the ticket https://issues.apache.org/jira/browse/FLINK-11815 . I was hoping to understand :- 1. The conditions that would give rise to these kinds of situations or how to verify if we are running into them. For example, how to verify that key group allocation is non-uniform 2. If these issues have been addressed in subsequent versions of flink 3. If there is any other information about the nature of scheduling jobs in flink that could give rise to the non-uniform distribution observed. Please let me know if further information needs to be provided. Thanks, Harshit