[ https://issues.apache.org/jira/browse/FLINK-10640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658929#comment-16658929 ]
Tony Xintong Song commented on FLINK-10640: ------------------------------------------- Hi [~Tison], thanks for the comment, and thanks for pinging Till. Indeed this is a big proposal, so I'd like to put this to a wide discussion before we the implementation. I think the fundamental difference between current slot-based management and resource based management is that, the latter considers the amount and size of slots while the former considers the amount only. Although the group sharing mechanism helps improve resource utilization without knowing precise resource need of individual tasks, it has limitations especially when tasks' parallelism have significant difference. We believe that resource based management will allow Flink to apply more flexible resource management strategies and algorithms, while the sharing group based slot allocation may be only one of them. As for [Flink-10407|https://issues.apache.org/jira/browse/FLINK-10407], we did notice that people are working on reactive mode and auto-scaling. As far as I can see, this issue does not run counter to the issues you mentioned. We are closely following the progress of these issues and carefully revising our design to make sure it supports reactive mode and auto-scaling. Elastic session is just a preliminary idea. It's more like the session mode but the amount of TMs is not fixed. We do not have clear plan for this yet. My point here is that we should allow portable strategies to make strategy related decisions (the size and number of TMs and when to create and release them). > Enable Slot Resource Profile for Resource Management > ---------------------------------------------------- > > Key: FLINK-10640 > URL: https://issues.apache.org/jira/browse/FLINK-10640 > Project: Flink > Issue Type: New Feature > Components: ResourceManager > Reporter: Tony Xintong Song > Priority: Major > > Motivation & Backgrounds > * The existing concept of task slots roughly represents how many pipeline of > tasks a TaskManager can hold. However, it does not consider the differences > in resource needs and usage of individual tasks. Enabling resource profiles > of slots may allow Flink to better allocate execution resources according to > tasks fine-grained resource needs. > * The community version Flink already contains APIs and some implementation > for slot resource profile. However, such logic is not truly used. > (ResourceProfile of slot requests is by default set to UNKNOWN with negative > values, thus matches any given slot.) > Preliminary Design > * Slot Management > A slot represents a certain amount of resources for a single pipeline of > tasks to run in on a TaskManager. Initially, a TaskManager does not have any > slots but a total amount of resources. When allocating, the ResourceManager > finds proper TMs to generate new slots for the tasks to run according to the > slot requests. Once generated, the slot's size (resource profile) does not > change until it's freed. ResourceManager can apply different, portable > strategies to allocate slots from TaskManagers. > * TM Management > The size and number of TaskManagers and when to start them can also be > flexible. TMs can be started and released dynamically, and may have different > sizes. We may have many different, portable strategies. E.g., an elastic > session that can run multiple jobs like the session mode while dynamically > adjusting the size of session (number of TMs) according to the realtime > working load. > * About Slot Sharing > Slot sharing is a good heuristic to easily calculate how many slots needed > to get the job running and get better utilization when there is no resource > profile in slots. However, with resource profiles enabling finer-grained > resource management, each individual task has its specific resource need and > it does not make much sense to have multiple tasks sharing the resource of > the same slot. Instead, we may introduce locality preferences/constraints to > support the semantics of putting tasks in same/different TMs in a more > general way. -- This message was sent by Atlassian JIRA (v7.6.3#76005)