Thanks for the responses, Till and Xintong. I second Xintong's comment that SSG-based runtime interface will give us the flexibility to achieve op/task-based approach. That's one of the most important reasons for our design choice.
Some cents regarding the default operator resource: - It might be good for the scenario of DataStream jobs. ** For light-weight operators, the accumulative configuration error will not be significant. Then, the resource of a task used is proportional to the number of operators it contains. ** For heavy operators like join and window or operators using the external resources, user will turn to the fine-grained resource configuration. - It can increase the stability for the standalone cluster where task executors registered are heterogeneous(with different default slot resources). - It might not be good for SQL users. The operators that SQL will be transferred to is a black box to the user. We also do not guarantee the cross-version of consistency of the transformation so far. I think it can be treated as a follow-up work when the fine-grained resource management is end-to-end ready. Best, Yangze Guo On Wed, Jan 20, 2021 at 11:16 AM Xintong Song <tonysong...@gmail.com> wrote: > > Thanks for the feedback, Till. > > ## I feel that what you proposed (operator-based + default value) might be > subsumed by the SSG-based approach. > Thinking of op_1 -> op_2, there are the following 4 cases, categorized by > whether the resource requirements are known to the users. > > 1. *Both known.* As previously mentioned, there's no reason to put > multiple operators whose individual resource requirements are already known > into the same group in fine-grained resource management. And if op_1 and > op_2 are in different groups, there should be no problem switching data > exchange mode from pipelined to blocking. This is equivalent to specifying > operator resource requirements in your proposal. > 2. *op_1 known, op_2 unknown.* Similar to 1), except that op_2 is in a > SSG whose resource is not specified thus would have the default slot > resource. This is equivalent to having default operator resources in your > proposal. > 3. *Both unknown*. The user can either set op_1 and op_2 to the same SSG > or separate SSGs. > - If op_1 and op_2 are in the same SSG, it will be equivalent to the > coarse-grained resource management, where op_1 and op_2 share a default > size slot no matter which data exchange mode is used. > - If op_1 and op_2 are in different SSGs, then each of them will use > a default size slot. This is equivalent to setting them with default > operator resources in your proposal. > 4. *Total (pipeline) or max (blocking) of op_1 and op_2 is known.* > - It is possible that the user learns the total / max resource > requirement from executing and monitoring the job, while not > being aware of > individual operator requirements. > - I believe this is the case your proposal does not cover. And TBH, > this is probably how most users learn the resource requirements, > according > to my experiences. > - In this case, the user might need to specify different resources if > he wants to switch the execution mode, which should not be worse than > not > being able to use fine-grained resource management. > > > ## An additional idea inspired by your proposal. > We may provide multiple options for deciding resources for SSGs whose > requirement is not specified, if needed. > > - Default slot resource (current design) > - Default operator resource times number of operators (equivalent to > your proposal) > > > ## Exposing internal runtime strategies > Theoretically, yes. Tying to the SSGs, the resource requirements might be > affected if how SSGs are internally handled changes in future. Practically, > I do not concretely see at the moment what kind of changes we may want in > future that might conflict with this FLIP proposal, as the question of > switching data exchange mode answered above. I'd suggest to not give up the > user friendliness we may gain now for the future problems that may or may > not exist. > > Moreover, the SSG-based approach has the flexibility to achieve the > equivalent behavior as the operator-based approach, if we set each operator > (or task) to a separate SSG. We can even provide a shortcut option to > automatically do that for users, if needed. > > > Thank you~ > > Xintong Song > > > > On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann <trohrm...@apache.org> wrote: > > > Thanks for the responses Xintong and Stephan, > > > > I agree that being able to define the resource requirements for a group of > > operators is more user friendly. However, my concern is that we are > > exposing thereby internal runtime strategies which might limit our > > flexibility to execute a given job. Moreover, the semantics of configuring > > resource requirements for SSGs could break if switching from streaming to > > batch execution. If one defines the resource requirements for op_1 -> op_2 > > which run in pipelined mode when using the streaming execution, then how do > > we interpret these requirements when op_1 -> op_2 are executed with a > > blocking data exchange in batch execution mode? Consequently, I am still > > leaning towards Stephan's proposal to set the resource requirements per > > operator. > > > > Maybe the following proposal makes the configuration easier: If the user > > wants to use fine-grained resource requirements, then she needs to specify > > the default size which is used for operators which have no explicit > > resource annotation. If this holds true, then every operator would have a > > resource requirement and the system can try to execute the operators in the > > best possible manner w/o being constrained by how the user set the SSG > > requirements. > > > > Cheers, > > Till > > > > On Tue, Jan 19, 2021 at 9:09 AM Xintong Song <tonysong...@gmail.com> > > wrote: > > > > > Thanks for the feedback, Stephan. > > > > > > Actually, your proposal has also come to my mind at some point. And I > > have > > > some concerns about it. > > > > > > > > > 1. It does not give users the same control as the SSG-based approach. > > > > > > > > > While both approaches do not require specifying for each operator, > > > SSG-based approach supports the semantic that "some operators together > > use > > > this much resource" while the operator-based approach doesn't. > > > > > > > > > Think of a long pipeline with m operators (o_1, o_2, ..., o_m), and at > > some > > > point there's an agg o_n (1 < n < m) which significantly reduces the data > > > amount. One can separate the pipeline into 2 groups SSG_1 (o_1, ..., o_n) > > > and SSG_2 (o_n+1, ... o_m), so that configuring much higher parallelisms > > > for operators in SSG_1 than for operators in SSG_2 won't lead to too much > > > wasting of resources. If the two SSGs end up needing different resources, > > > with the SSG-based approach one can directly specify resources for the > > two > > > groups. However, with the operator-based approach, the user will have to > > > specify resources for each operator in one of the two groups, and tune > > the > > > default slot resource via configurations to fit the other group. > > > > > > > > > 2. It increases the chance of breaking operator chains. > > > > > > > > > Setting chainnable operators into different slot sharing groups will > > > prevent them from being chained. In the current implementation, > > downstream > > > operators, if SSG not explicitly specified, will be set to the same group > > > as the chainable upstream operators (unless multiple upstream operators > > in > > > different groups), to reduce the chance of breaking chains. > > > > > > > > > Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, deciding SSGs > > > based on whether resource is specified we will easily get groups like > > (o_1, > > > o_3) & (o_2, o_4), where none of the operators can be chained. This is > > also > > > possible for the SSG-based approach, but I believe the chance is much > > > smaller because there's no strong reason for users to specify the groups > > > with alternate operators like that. We are more likely to get groups like > > > (o_1, o_2) & (o_3, o_4), where the chain breaks only between o_2 and o_3. > > > > > > > > > 3. It complicates the system by having two different mechanisms for > > sharing > > > managed memory in a slot. > > > > > > > > > - In FLIP-141, we introduced the intra-slot managed memory sharing > > > mechanism, where managed memory is first distributed according to the > > > consumer type, then further distributed across operators of that consumer > > > type. > > > > > > - With the operator-based approach, managed memory size specified for an > > > operator should account for all the consumer types of that operator. That > > > means the managed memory is first distributed across operators, then > > > distributed to different consumer types of each operator. > > > > > > > > > Unfortunately, the different order of the two calculation steps can lead > > to > > > different results. To be specific, the semantic of the configuration > > option > > > `consumer-weights` changed (within a slot vs. within an operator). > > > > > > > > > > > > To sum up things: > > > > > > While (3) might be a bit more implementation related, I think (1) and (2) > > > somehow suggest that, the price for the proposed approach to avoid > > > specifying resource for every operator is that it's not as independent > > from > > > operator chaining and slot sharing as the operator-based approach > > discussed > > > in the FLIP. > > > > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > > > > On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen <se...@apache.org> wrote: > > > > > > > Thanks a lot, Yangze and Xintong for this FLIP. > > > > > > > > I want to say, first of all, that this is super well written. And the > > > > points that the FLIP makes about how to expose the configuration to > > users > > > > is exactly the right thing to figure out first. > > > > So good job here! > > > > > > > > About how to let users specify the resource profiles. If I can sum the > > > FLIP > > > > and previous discussion up in my own words, the problem is the > > following: > > > > > > > > Operator-level specification is the simplest and cleanest approach, > > > because > > > > > it avoids mixing operator configuration (resource) and scheduling. No > > > > > matter what other parameters change (chaining, slot sharing, > > switching > > > > > pipelined and blocking shuffles), the resource profiles stay the > > same. > > > > > But it would require that a user specifies resources on all > > operators, > > > > > which makes it hard to use. That's why the FLIP suggests going with > > > > > specifying resources on a Sharing-Group. > > > > > > > > > > > > I think both thoughts are important, so can we find a solution where > > the > > > > Resource Profiles are specified on an Operator, but we still avoid that > > > we > > > > need to specify a resource profile on every operator? > > > > > > > > What do you think about something like the following: > > > > - Resource Profiles are specified on an operator level. > > > > - Not all operators need profiles > > > > - All Operators without a Resource Profile ended up in the default > > slot > > > > sharing group with a default profile (will get a default slot). > > > > - All Operators with a Resource Profile will go into another slot > > > sharing > > > > group (the resource-specified-group). > > > > - Users can define different slot sharing groups for operators like > > > they > > > > do now, with the exception that you cannot mix operators that have a > > > > resource profile and operators that have no resource profile. > > > > - The default case where no operator has a resource profile is just a > > > > special case of this model > > > > - The chaining logic sums up the profiles per operator, like it does > > > now, > > > > and the scheduler sums up the profiles of the tasks that it schedules > > > > together. > > > > > > > > > > > > There is another question about reactive scaling raised in the FLIP. I > > > need > > > > to think a bit about that. That is indeed a bit more tricky once we > > have > > > > slots of different sizes. > > > > It is not clear then which of the different slot requests the > > > > ResourceManager should fulfill when new resources (TMs) show up, or how > > > the > > > > JobManager redistributes the slots resources when resources (TMs) > > > disappear > > > > This question is pretty orthogonal, though, to the "how to specify the > > > > resources". > > > > > > > > > > > > Best, > > > > Stephan > > > > > > > > On Fri, Jan 8, 2021 at 5:14 AM Xintong Song <tonysong...@gmail.com> > > > wrote: > > > > > > > > > Thanks for drafting the FLIP and driving the discussion, Yangze. > > > > > And Thanks for the feedback, Till and Chesnay. > > > > > > > > > > @Till, > > > > > > > > > > I agree that specifying requirements for SSGs means that SSGs need to > > > be > > > > > supported in fine-grained resource management, otherwise each > > operator > > > > > might use as many resources as the whole group. However, I cannot > > think > > > > of > > > > > a strong reason for not supporting SSGs in fine-grained resource > > > > > management. > > > > > > > > > > > > > > > > Interestingly, if all operators have their resources properly > > > > specified, > > > > > > then slot sharing is no longer needed because Flink could slice off > > > the > > > > > > appropriately sized slots for every Task individually. > > > > > > > > > > > > > > > > So for example, if we have a job consisting of two operator op_1 and > > > op_2 > > > > > > where each op needs 100 MB of memory, we would then say that the > > slot > > > > > > sharing group needs 200 MB of memory to run. If we have a cluster > > > with > > > > 2 > > > > > > TMs with one slot of 100 MB each, then the system cannot run this > > > job. > > > > If > > > > > > the resources were specified on an operator level, then the system > > > > could > > > > > > still make the decision to deploy op_1 to TM_1 and op_2 to TM_2. > > > > > > > > > > > > > > > Couldn't agree more that if all operators' requirements are properly > > > > > specified, slot sharing should be no longer needed. I think this > > > exactly > > > > > disproves the example. If we already know op_1 and op_2 each needs > > 100 > > > MB > > > > > of memory, why would we put them in the same group? If they are in > > > > separate > > > > > groups, with the proposed approach the system can freely deploy them > > to > > > > > either a 200 MB TM or two 100 MB TMs. > > > > > > > > > > Moreover, the precondition for not needing slot sharing is having > > > > resource > > > > > requirements properly specified for all operators. This is not always > > > > > possible, and usually requires tremendous efforts. One of the > > benefits > > > > for > > > > > SSG-based requirements is that it allows the user to freely decide > > the > > > > > granularity, thus efforts they want to pay. I would consider SSG in > > > > > fine-grained resource management as a group of operators that the > > user > > > > > would like to specify the total resource for. There can be only one > > > group > > > > > in the job, 2~3 groups dividing the job into a few major parts, or as > > > > many > > > > > groups as the number of tasks/operators, depending on how > > fine-grained > > > > the > > > > > user is able to specify the resources. > > > > > > > > > > Having to support SSGs might be a constraint. But given that all the > > > > > current scheduler implementations already support SSGs, I tend to > > think > > > > > that as an acceptable price for the above discussed usability and > > > > > flexibility. > > > > > > > > > > @Chesnay > > > > > > > > > > Will declaring them on slot sharing groups not also waste resources > > if > > > > the > > > > > > parallelism of operators within that group are different? > > > > > > > > > > > Yes. It's a trade-off between usability and resource utilization. To > > > > avoid > > > > > such wasting, the user can define more groups, so that each group > > > > contains > > > > > less operators and the chance of having operators with different > > > > > parallelism will be reduced. The price is to have more resource > > > > > requirements to specify. > > > > > > > > > > It also seems like quite a hassle for users having to recalculate the > > > > > > resource requirements if they change the slot sharing. > > > > > > I'd think that it's not really workable for users that create a set > > > of > > > > > > re-usable operators which are mixed and matched in their > > > applications; > > > > > > managing the resources requirements in such a setting would be a > > > > > > nightmare, and in the end would require operator-level requirements > > > any > > > > > > way. > > > > > > In that sense, I'm not even sure whether it really increases > > > usability. > > > > > > > > > > > > > > > > - As mentioned in my reply to Till's comment, there's no reason to > > > put > > > > > multiple operators whose individual resource requirements are > > > already > > > > > known > > > > > into the same group in fine-grained resource management. > > > > > - Even an operator implementation is reused for multiple > > > applications, > > > > > it does not guarantee the same resource requirements. During our > > > years > > > > > of > > > > > practices in Alibaba, with per-operator requirements specified for > > > > > Blink's > > > > > fine-grained resource management, very few users (including our > > > > > specialists > > > > > who are dedicated to supporting Blink users) are as experienced as > > > to > > > > > accurately predict/estimate the operator resource requirements. > > Most > > > > > people > > > > > rely on the execution-time metrics (throughput, delay, cpu load, > > > > memory > > > > > usage, GC pressure, etc.) to improve the specification. > > > > > > > > > > To sum up: > > > > > If the user is capable of providing proper resource requirements for > > > > every > > > > > operator, that's definitely a good thing and we would not need to > > rely > > > on > > > > > the SSGs. However, that shouldn't be a *must* for the fine-grained > > > > resource > > > > > management to work. For those users who are capable and do not like > > > > having > > > > > to set each operator to a separate SSG, I would be ok to have both > > > > > SSG-based and operator-based runtime interfaces and to only fallback > > to > > > > the > > > > > SSG requirements when the operator requirements are not specified. > > > > However, > > > > > as the first step, I think we should prioritise the use cases where > > > users > > > > > are not that experienced. > > > > > > > > > > Thank you~ > > > > > > > > > > Xintong Song > > > > > > > > > > On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler <ches...@apache.org> > > > > > wrote: > > > > > > > > > > > Will declaring them on slot sharing groups not also waste resources > > > if > > > > > > the parallelism of operators within that group are different? > > > > > > > > > > > > It also seems like quite a hassle for users having to recalculate > > the > > > > > > resource requirements if they change the slot sharing. > > > > > > I'd think that it's not really workable for users that create a set > > > of > > > > > > re-usable operators which are mixed and matched in their > > > applications; > > > > > > managing the resources requirements in such a setting would be a > > > > > > nightmare, and in the end would require operator-level requirements > > > any > > > > > > way. > > > > > > In that sense, I'm not even sure whether it really increases > > > usability. > > > > > > > > > > > > My main worry is that it if we wire the runtime to work on SSGs > > it's > > > > > > gonna be difficult to implement more fine-grained approaches, which > > > > > > would not be the case if, for the runtime, they are always defined > > on > > > > an > > > > > > operator-level. > > > > > > > > > > > > On 1/7/2021 2:42 PM, Till Rohrmann wrote: > > > > > > > Thanks for drafting this FLIP and starting this discussion > > Yangze. > > > > > > > > > > > > > > I like that defining resource requirements on a slot sharing > > group > > > > > makes > > > > > > > the overall setup easier and improves usability of resource > > > > > requirements. > > > > > > > > > > > > > > What I do not like about it is that it changes slot sharing > > groups > > > > from > > > > > > > being a scheduling hint to something which needs to be supported > > in > > > > > order > > > > > > > to support fine grained resource requirements. So far, the idea > > of > > > > slot > > > > > > > sharing groups was that it tells the system that a set of > > operators > > > > can > > > > > > be > > > > > > > deployed in the same slot. But the system still had the freedom > > to > > > > say > > > > > > that > > > > > > > it would rather place these tasks in different slots if it > > wanted. > > > If > > > > > we > > > > > > > now specify resource requirements on a per slot sharing group, > > then > > > > the > > > > > > > only option for a scheduler which does not support slot sharing > > > > groups > > > > > is > > > > > > > to say that every operator in this slot sharing group needs a > > slot > > > > with > > > > > > the > > > > > > > same resources as the whole group. > > > > > > > > > > > > > > So for example, if we have a job consisting of two operator op_1 > > > and > > > > > op_2 > > > > > > > where each op needs 100 MB of memory, we would then say that the > > > slot > > > > > > > sharing group needs 200 MB of memory to run. If we have a cluster > > > > with > > > > > 2 > > > > > > > TMs with one slot of 100 MB each, then the system cannot run this > > > > job. > > > > > If > > > > > > > the resources were specified on an operator level, then the > > system > > > > > could > > > > > > > still make the decision to deploy op_1 to TM_1 and op_2 to TM_2. > > > > > > > > > > > > > > Originally, one of the primary goals of slot sharing groups was > > to > > > > make > > > > > > it > > > > > > > easier for the user to reason about how many slots a job needs > > > > > > independent > > > > > > > of the actual number of operators in the job. Interestingly, if > > all > > > > > > > operators have their resources properly specified, then slot > > > sharing > > > > is > > > > > > no > > > > > > > longer needed because Flink could slice off the appropriately > > sized > > > > > slots > > > > > > > for every Task individually. What matters is whether the whole > > > > cluster > > > > > > has > > > > > > > enough resources to run all tasks or not. > > > > > > > > > > > > > > Cheers, > > > > > > > Till > > > > > > > > > > > > > > On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo <karma...@gmail.com> > > > > wrote: > > > > > > > > > > > > > >> Hi, there, > > > > > > >> > > > > > > >> We would like to start a discussion thread on "FLIP-156: Runtime > > > > > > >> Interfaces for Fine-Grained Resource Requirements"[1], where we > > > > > > >> propose Slot Sharing Group (SSG) based runtime interfaces for > > > > > > >> specifying fine-grained resource requirements. > > > > > > >> > > > > > > >> In this FLIP: > > > > > > >> - Expound the user story of fine-grained resource management. > > > > > > >> - Propose runtime interfaces for specifying SSG-based resource > > > > > > >> requirements. > > > > > > >> - Discuss the pros and cons of the three potential granularities > > > for > > > > > > >> specifying the resource requirements (op, task and slot sharing > > > > group) > > > > > > >> and explain why we choose the slot sharing group. > > > > > > >> > > > > > > >> Please find more details in the FLIP wiki document [1]. Looking > > > > > > >> forward to your feedback. > > > > > > >> > > > > > > >> [1] > > > > > > >> > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > > > > >> > > > > > > >> Best, > > > > > > >> Yangze Guo > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > >