Thanks Yuepeng and Rui for creating this FLIP.

+1 in general
The idea is straight forward: best-effort gather all the slot requests
and offered slots to form an overview before assigning slots, trying to
balance the loads of task managers when assigning slots.

I have one comment regarding the configuration for ease of use:

IIUC, this FLIP uses an existing config 'cluster.evenly-spread-out-slots'
as the main switch of the new feature. That is, from user perspective,
with this improvement, the 'cluster.evenly-spread-out-slots' feature not
only balances the number of slots on task managers, but also balances the
number of tasks. This is a behavior change anyway. Besides that, it also
requires users to set 'slot.sharing-strategy' to 'TASK_BALANCED_PREFERRED'
to balance the tasks in each slot.

I think we can introduce a new config option
`taskmanager.load-balance.mode`,
which accepts "None"/"Slots"/"Tasks". `cluster.evenly-spread-out-slots`
can be superseded by the "Slots" mode and get deprecated. In the future
it can support more mode, e.g. "CpuCores", to work better for jobs with
fine-grained resources. The proposed config option
`slot.request.max-interval`
then can be renamed to `taskmanager.load-balance.request-stablizing-timeout`
to show its relation with the feature. The proposed `slot.sharing-strategy`
is not needed, because the configured "Tasks" mode will do the work.

WDYT?

Thanks,
Zhu Zhu

Yuepeng Pan <panyuep...@apache.org> 于2023年9月25日周一 16:26写道:

> Hi all,
>
>
> I and Fan Rui(CC’ed) created the FLIP-370[1] to support balanced tasks
> scheduling.
>
>
> The current strategy of Flink to deploy tasks sometimes leads some
> TMs(TaskManagers) to have more tasks while others have fewer tasks,
> resulting in excessive resource utilization at some TMs that contain more
> tasks and becoming a bottleneck for the entire job processing. Developing
> strategies to achieve task load balancing for TMs and reducing job
> bottlenecks becomes very meaningful.
>
>
> The raw design and discussions could be found in the Flink JIRA[2] and
> Google doc[3]. We really appreciate Zhu Zhu(CC’ed) for providing some
> valuable help and suggestions in advance.
>
>
> Please refer to the FLIP[1] document for more details about the proposed
> design and implementation. We welcome any feedback and opinions on this
> proposal.
>
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-370%3A+Support+Balanced+Tasks+Scheduling
>
> [2] https://issues.apache.org/jira/browse/FLINK-31757
>
> [3]
> https://docs.google.com/document/d/14WhrSNGBdcsRl3IK7CZO-RaZ5KXU2X1dWqxPEFr3iS8
>
>
> Best,
>
> Yuepeng Pan
>

Reply via email to