Thanks for the summary, Yangze. The changes and follow-up issues LGTM. Let's wait for responses from the others before starting a vote.
Thank you~ Xintong Song On Tue, Jan 26, 2021 at 11:08 AM Yangze Guo <karma...@gmail.com> wrote: > Thanks everyone for the lively discussion. I'd like to try to > summarize the current convergence in the discussion. Please let me > know if I got things wrong or missed something crucial here. > > Change of this FLIP: > - Treat the SSG resource requirements as a hint instead of a > restriction for the runtime. That's should be explicitly explained in > the JavaDocs. > > Potential follow-up issues if needed: > - Provide operator-level resource configuration interface. > - Provide multiple options for deciding resources for SSGs whose > requirement is not specified: > ** Default slot resource. > ** Default operator resource times number of operators. > > If there are no other issues, I'll update the FLIP accordingly and > start a vote thread. Thanks all for the valuable feedback again. > > Best, > Yangze Guo > > Best, > Yangze Guo > > > On Fri, Jan 22, 2021 at 11:30 AM Xintong Song <tonysong...@gmail.com> > wrote: > > > > > > FGRuntimeInterface.png > > > > Thank you~ > > > > Xintong Song > > > > > > > > On Fri, Jan 22, 2021 at 11:11 AM Xintong Song <tonysong...@gmail.com> > wrote: > >> > >> I think Chesnay's proposal could actually work. IIUC, the keypoint is > to derive operator requirements from SSG requirements on the API side, so > that the runtime only deals with operator requirements. It's debatable how > the deriving should be done though. E.g., an alternative could be to evenly > divide the SSG requirement into requirements of operators in the group. > >> > >> > >> However, I'm not entirely sure which option is more desired. > Illustrating my understanding in the following figure, in which on the top > is Chesnay's proposal and on the bottom is the SSG-based proposal in this > FLIP. > >> > >> > >> > >> I think the major difference between the two approaches is where > deriving operator requirements from SSG requirements happens. > >> > >> - Chesnay's proposal simplifies the runtime logic and the interface to > expose, at the price of moving more complexity (i.e. the deriving) to the > API side. The question is, where do we prefer to keep the complexity? I'm > slightly leaning towards having a thin API and keep the complexity in > runtime if possible. > >> > >> - Notice that the dash line arrows represent optional steps that are > needed only for schedulers that do not respect SSGs, which we don't have at > the moment. If we only look at the solid line arrows, then the SSG-based > approach is much simpler, without needing to derive and aggregate the > requirements back and forth. I'm not sure about complicating the current > design only for the potential future needs. > >> > >> > >> Thank you~ > >> > >> Xintong Song > >> > >> > >> > >> > >> On Fri, Jan 22, 2021 at 7:35 AM Chesnay Schepler <ches...@apache.org> > wrote: > >>> > >>> You're raising a good point, but I think I can rectify that with a > minor > >>> adjustment. > >>> > >>> Default requirements are whatever the default requirements are, setting > >>> the requirements for one operator has no effect on other operators. > >>> > >>> With these rules, and some API enhancements, the following mockup would > >>> replicate the SSG-based behavior: > >>> > >>> Map<SlotSharingGroupId, Requirements> requirements = ... > >>> for slotSharingGroup in env.getSlotSharingGroups() { > >>> vertices = slotSharingGroup.getVertices() > >>> > vertices.first().setRequirements(requirements.get(slotSharingGroup.getID()) > >>> vertices.remainint().setRequirements(ZERO) > >>> } > >>> > >>> We could even allow setting requirements on slotsharing-groups > >>> colocation-groups and internally translate them accordingly. > >>> I can't help but feel this is a plain API issue. > >>> > >>> On 1/21/2021 9:44 AM, Till Rohrmann wrote: > >>> > If I understand you correctly Chesnay, then you want to decouple the > >>> > resource requirement specification from the slot sharing group > >>> > assignment. Hence, per default all operators would be in the same > slot > >>> > sharing group. If there is no operator with a resource specification, > >>> > then the system would allocate a default slot for it. If there is at > >>> > least one operator, then the system would sum up all the specified > >>> > resources and allocate a slot of this size. This effectively means > >>> > that all unspecified operators will implicitly have a zero resource > >>> > requirement. Did I understand your idea correctly? > >>> > > >>> > I am wondering whether this wouldn't lead to a surprising behaviour > >>> > for the user. If the user specifies the resource requirements for a > >>> > single operator, then he probably will assume that the other > operators > >>> > will get the default share of resources and not nothing. > >>> > > >>> > Cheers, > >>> > Till > >>> > > >>> > On Thu, Jan 21, 2021 at 3:25 AM Chesnay Schepler <ches...@apache.org > >>> > <mailto:ches...@apache.org>> wrote: > >>> > > >>> > Is there even a functional difference between specifying the > >>> > requirements for an SSG vs specifying the same requirements on a > >>> > single > >>> > operator within that group (ideally a colocation group to avoid > this > >>> > whole hint business)? > >>> > > >>> > Wouldn't we get the best of both worlds in the latter case? > >>> > > >>> > Users can take shortcuts to define shared requirements, > >>> > but refine them further as needed on a per-operator basis, > >>> > without changing semantics of slotsharing groups > >>> > nor the runtime being locked into SSG-based requirements. > >>> > > >>> > (And before anyone argues what happens if slotsharing groups > >>> > change or > >>> > whatnot, that's a plain API issue that we could surely solve. (A > >>> > plain > >>> > iteration over slotsharing groups and therein contained operators > >>> > would > >>> > suffice)). > >>> > > >>> > On 1/20/2021 6:48 PM, Till Rohrmann wrote: > >>> > > Maybe a different minor idea: Would it be possible to treat > the SSG > >>> > > resource requirements as a hint for the runtime similar to how > >>> > slot sharing > >>> > > groups are designed at the moment? Meaning that we don't give > >>> > the guarantee > >>> > > that Flink will always deploy this set of tasks together no > >>> > matter what > >>> > > comes. If, for example, the runtime can derive by some means > the > >>> > resource > >>> > > requirements for each task based on the requirements for the > >>> > SSG, this > >>> > > could be possible. One easy strategy would be to give every > task > >>> > the same > >>> > > resources as the whole slot sharing group. Another one could be > >>> > > distributing the resources equally among the tasks. This does > >>> > not even have > >>> > > to be implemented but we would give ourselves the freedom to > change > >>> > > scheduling if need should arise. > >>> > > > >>> > > Cheers, > >>> > > Till > >>> > > > >>> > > On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo <karma...@gmail.com > >>> > <mailto:karma...@gmail.com>> wrote: > >>> > > > >>> > >> Thanks for the responses, Till and Xintong. > >>> > >> > >>> > >> I second Xintong's comment that SSG-based runtime interface > >>> > will give > >>> > >> us the flexibility to achieve op/task-based approach. That's > one of > >>> > >> the most important reasons for our design choice. > >>> > >> > >>> > >> Some cents regarding the default operator resource: > >>> > >> - It might be good for the scenario of DataStream jobs. > >>> > >> ** For light-weight operators, the accumulative > >>> > configuration error > >>> > >> will not be significant. Then, the resource of a task used is > >>> > >> proportional to the number of operators it contains. > >>> > >> ** For heavy operators like join and window or operators > >>> > using the > >>> > >> external resources, user will turn to the fine-grained > resource > >>> > >> configuration. > >>> > >> - It can increase the stability for the standalone cluster > >>> > where task > >>> > >> executors registered are heterogeneous(with different default > slot > >>> > >> resources). > >>> > >> - It might not be good for SQL users. The operators that SQL > >>> > will be > >>> > >> transferred to is a black box to the user. We also do not > guarantee > >>> > >> the cross-version of consistency of the transformation so far. > >>> > >> > >>> > >> I think it can be treated as a follow-up work when the > fine-grained > >>> > >> resource management is end-to-end ready. > >>> > >> > >>> > >> Best, > >>> > >> Yangze Guo > >>> > >> > >>> > >> > >>> > >> On Wed, Jan 20, 2021 at 11:16 AM Xintong Song > >>> > <tonysong...@gmail.com <mailto:tonysong...@gmail.com>> > >>> > >> wrote: > >>> > >>> Thanks for the feedback, Till. > >>> > >>> > >>> > >>> ## I feel that what you proposed (operator-based + default > >>> > value) might > >>> > >> be > >>> > >>> subsumed by the SSG-based approach. > >>> > >>> Thinking of op_1 -> op_2, there are the following 4 cases, > >>> > categorized by > >>> > >>> whether the resource requirements are known to the users. > >>> > >>> > >>> > >>> 1. *Both known.* As previously mentioned, there's no > >>> > reason to put > >>> > >>> multiple operators whose individual resource requirements > >>> > are already > >>> > >> known > >>> > >>> into the same group in fine-grained resource management. > >>> > And if op_1 > >>> > >> and > >>> > >>> op_2 are in different groups, there should be no problem > >>> > switching > >>> > >> data > >>> > >>> exchange mode from pipelined to blocking. This is > >>> > equivalent to > >>> > >> specifying > >>> > >>> operator resource requirements in your proposal. > >>> > >>> 2. *op_1 known, op_2 unknown.* Similar to 1), except that > >>> > op_2 is in a > >>> > >>> SSG whose resource is not specified thus would have the > >>> > default slot > >>> > >>> resource. This is equivalent to having default operator > >>> > resources in > >>> > >> your > >>> > >>> proposal. > >>> > >>> 3. *Both unknown*. The user can either set op_1 and op_2 > >>> > to the same > >>> > >> SSG > >>> > >>> or separate SSGs. > >>> > >>> - If op_1 and op_2 are in the same SSG, it will be > >>> > equivalent to > >>> > >> the > >>> > >>> coarse-grained resource management, where op_1 and > op_2 > >>> > share a > >>> > >> default > >>> > >>> size slot no matter which data exchange mode is used. > >>> > >>> - If op_1 and op_2 are in different SSGs, then each of > >>> > them will > >>> > >> use > >>> > >>> a default size slot. This is equivalent to setting > them > >>> > with > >>> > >> default > >>> > >>> operator resources in your proposal. > >>> > >>> 4. *Total (pipeline) or max (blocking) of op_1 and op_2 > is > >>> > known.* > >>> > >>> - It is possible that the user learns the total / max > >>> > resource > >>> > >>> requirement from executing and monitoring the job, > >>> > while not > >>> > >>> being aware of > >>> > >>> individual operator requirements. > >>> > >>> - I believe this is the case your proposal does not > >>> > cover. And TBH, > >>> > >>> this is probably how most users learn the resource > >>> > requirements, > >>> > >>> according > >>> > >>> to my experiences. > >>> > >>> - In this case, the user might need to specify > >>> > different resources > >>> > >> if > >>> > >>> he wants to switch the execution mode, which should > not > >>> > be worse > >>> > >> than not > >>> > >>> being able to use fine-grained resource management. > >>> > >>> > >>> > >>> > >>> > >>> ## An additional idea inspired by your proposal. > >>> > >>> We may provide multiple options for deciding resources for > >>> > SSGs whose > >>> > >>> requirement is not specified, if needed. > >>> > >>> > >>> > >>> - Default slot resource (current design) > >>> > >>> - Default operator resource times number of operators > >>> > (equivalent to > >>> > >>> your proposal) > >>> > >>> > >>> > >>> > >>> > >>> ## Exposing internal runtime strategies > >>> > >>> Theoretically, yes. Tying to the SSGs, the resource > >>> > requirements might be > >>> > >>> affected if how SSGs are internally handled changes in > future. > >>> > >> Practically, > >>> > >>> I do not concretely see at the moment what kind of changes we > >>> > may want in > >>> > >>> future that might conflict with this FLIP proposal, as the > >>> > question of > >>> > >>> switching data exchange mode answered above. I'd suggest to > >>> > not give up > >>> > >> the > >>> > >>> user friendliness we may gain now for the future problems > that > >>> > may or may > >>> > >>> not exist. > >>> > >>> > >>> > >>> Moreover, the SSG-based approach has the flexibility to > >>> > achieve the > >>> > >>> equivalent behavior as the operator-based approach, if we > set each > >>> > >> operator > >>> > >>> (or task) to a separate SSG. We can even provide a shortcut > >>> > option to > >>> > >>> automatically do that for users, if needed. > >>> > >>> > >>> > >>> > >>> > >>> Thank you~ > >>> > >>> > >>> > >>> Xintong Song > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann > >>> > <trohrm...@apache.org <mailto:trohrm...@apache.org>> > >>> > >> wrote: > >>> > >>>> Thanks for the responses Xintong and Stephan, > >>> > >>>> > >>> > >>>> I agree that being able to define the resource requirements > for a > >>> > >> group of > >>> > >>>> operators is more user friendly. However, my concern is that > >>> > we are > >>> > >>>> exposing thereby internal runtime strategies which might > >>> > limit our > >>> > >>>> flexibility to execute a given job. Moreover, the semantics > of > >>> > >> configuring > >>> > >>>> resource requirements for SSGs could break if switching from > >>> > streaming > >>> > >> to > >>> > >>>> batch execution. If one defines the resource requirements > for > >>> > op_1 -> > >>> > >> op_2 > >>> > >>>> which run in pipelined mode when using the streaming > >>> > execution, then > >>> > >> how do > >>> > >>>> we interpret these requirements when op_1 -> op_2 are > >>> > executed with a > >>> > >>>> blocking data exchange in batch execution mode? > Consequently, > >>> > I am > >>> > >> still > >>> > >>>> leaning towards Stephan's proposal to set the resource > >>> > requirements per > >>> > >>>> operator. > >>> > >>>> > >>> > >>>> Maybe the following proposal makes the configuration easier: > >>> > If the > >>> > >> user > >>> > >>>> wants to use fine-grained resource requirements, then she > >>> > needs to > >>> > >> specify > >>> > >>>> the default size which is used for operators which have no > >>> > explicit > >>> > >>>> resource annotation. If this holds true, then every operator > >>> > would > >>> > >> have a > >>> > >>>> resource requirement and the system can try to execute the > >>> > operators > >>> > >> in the > >>> > >>>> best possible manner w/o being constrained by how the user > >>> > set the SSG > >>> > >>>> requirements. > >>> > >>>> > >>> > >>>> Cheers, > >>> > >>>> Till > >>> > >>>> > >>> > >>>> On Tue, Jan 19, 2021 at 9:09 AM Xintong Song > >>> > <tonysong...@gmail.com <mailto:tonysong...@gmail.com>> > >>> > >>>> wrote: > >>> > >>>> > >>> > >>>>> Thanks for the feedback, Stephan. > >>> > >>>>> > >>> > >>>>> Actually, your proposal has also come to my mind at some > >>> > point. And I > >>> > >>>> have > >>> > >>>>> some concerns about it. > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> 1. It does not give users the same control as the SSG-based > >>> > approach. > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> While both approaches do not require specifying for each > >>> > operator, > >>> > >>>>> SSG-based approach supports the semantic that "some > operators > >>> > >> together > >>> > >>>> use > >>> > >>>>> this much resource" while the operator-based approach > doesn't. > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> Think of a long pipeline with m operators (o_1, o_2, ..., > >>> > o_m), and > >>> > >> at > >>> > >>>> some > >>> > >>>>> point there's an agg o_n (1 < n < m) which significantly > >>> > reduces the > >>> > >> data > >>> > >>>>> amount. One can separate the pipeline into 2 groups SSG_1 > >>> > (o_1, ..., > >>> > >> o_n) > >>> > >>>>> and SSG_2 (o_n+1, ... o_m), so that configuring much higher > >>> > >> parallelisms > >>> > >>>>> for operators in SSG_1 than for operators in SSG_2 won't > >>> > lead to too > >>> > >> much > >>> > >>>>> wasting of resources. If the two SSGs end up needing > different > >>> > >> resources, > >>> > >>>>> with the SSG-based approach one can directly specify > >>> > resources for > >>> > >> the > >>> > >>>> two > >>> > >>>>> groups. However, with the operator-based approach, the > user will > >>> > >> have to > >>> > >>>>> specify resources for each operator in one of the two > >>> > groups, and > >>> > >> tune > >>> > >>>> the > >>> > >>>>> default slot resource via configurations to fit the other > group. > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> 2. It increases the chance of breaking operator chains. > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> Setting chainnable operators into different slot sharing > >>> > groups will > >>> > >>>>> prevent them from being chained. In the current > implementation, > >>> > >>>> downstream > >>> > >>>>> operators, if SSG not explicitly specified, will be set to > >>> > the same > >>> > >> group > >>> > >>>>> as the chainable upstream operators (unless multiple > upstream > >>> > >> operators > >>> > >>>> in > >>> > >>>>> different groups), to reduce the chance of breaking chains. > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, > >>> > deciding > >>> > >> SSGs > >>> > >>>>> based on whether resource is specified we will easily get > >>> > groups like > >>> > >>>> (o_1, > >>> > >>>>> o_3) & (o_2, o_4), where none of the operators can be > >>> > chained. This > >>> > >> is > >>> > >>>> also > >>> > >>>>> possible for the SSG-based approach, but I believe the > >>> > chance is much > >>> > >>>>> smaller because there's no strong reason for users to > >>> > specify the > >>> > >> groups > >>> > >>>>> with alternate operators like that. We are more likely to > >>> > get groups > >>> > >> like > >>> > >>>>> (o_1, o_2) & (o_3, o_4), where the chain breaks only > between > >>> > o_2 and > >>> > >> o_3. > >>> > >>>>> > >>> > >>>>> 3. It complicates the system by having two different > >>> > mechanisms for > >>> > >>>> sharing > >>> > >>>>> managed memory in a slot. > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> - In FLIP-141, we introduced the intra-slot managed memory > >>> > sharing > >>> > >>>>> mechanism, where managed memory is first distributed > >>> > according to the > >>> > >>>>> consumer type, then further distributed across operators > of that > >>> > >> consumer > >>> > >>>>> type. > >>> > >>>>> > >>> > >>>>> - With the operator-based approach, managed memory size > >>> > specified > >>> > >> for an > >>> > >>>>> operator should account for all the consumer types of that > >>> > operator. > >>> > >> That > >>> > >>>>> means the managed memory is first distributed across > >>> > operators, then > >>> > >>>>> distributed to different consumer types of each operator. > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> Unfortunately, the different order of the two calculation > >>> > steps can > >>> > >> lead > >>> > >>>> to > >>> > >>>>> different results. To be specific, the semantic of the > >>> > configuration > >>> > >>>> option > >>> > >>>>> `consumer-weights` changed (within a slot vs. within an > >>> > operator). > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> To sum up things: > >>> > >>>>> > >>> > >>>>> While (3) might be a bit more implementation related, I > >>> > think (1) > >>> > >> and (2) > >>> > >>>>> somehow suggest that, the price for the proposed approach > to > >>> > avoid > >>> > >>>>> specifying resource for every operator is that it's not as > >>> > >> independent > >>> > >>>> from > >>> > >>>>> operator chaining and slot sharing as the operator-based > >>> > approach > >>> > >>>> discussed > >>> > >>>>> in the FLIP. > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> Thank you~ > >>> > >>>>> > >>> > >>>>> Xintong Song > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen > >>> > <se...@apache.org <mailto:se...@apache.org>> > >>> > >> wrote: > >>> > >>>>>> Thanks a lot, Yangze and Xintong for this FLIP. > >>> > >>>>>> > >>> > >>>>>> I want to say, first of all, that this is super well > >>> > written. And > >>> > >> the > >>> > >>>>>> points that the FLIP makes about how to expose the > >>> > configuration to > >>> > >>>> users > >>> > >>>>>> is exactly the right thing to figure out first. > >>> > >>>>>> So good job here! > >>> > >>>>>> > >>> > >>>>>> About how to let users specify the resource profiles. If I > >>> > can sum > >>> > >> the > >>> > >>>>> FLIP > >>> > >>>>>> and previous discussion up in my own words, the problem > is the > >>> > >>>> following: > >>> > >>>>>> Operator-level specification is the simplest and cleanest > >>> > approach, > >>> > >>>>> because > >>> > >>>>>>> it avoids mixing operator configuration (resource) and > >>> > >> scheduling. No > >>> > >>>>>>> matter what other parameters change (chaining, slot > sharing, > >>> > >>>> switching > >>> > >>>>>>> pipelined and blocking shuffles), the resource profiles > >>> > stay the > >>> > >>>> same. > >>> > >>>>>>> But it would require that a user specifies resources on > all > >>> > >>>> operators, > >>> > >>>>>>> which makes it hard to use. That's why the FLIP suggests > going > >>> > >> with > >>> > >>>>>>> specifying resources on a Sharing-Group. > >>> > >>>>>> > >>> > >>>>>> I think both thoughts are important, so can we find a > solution > >>> > >> where > >>> > >>>> the > >>> > >>>>>> Resource Profiles are specified on an Operator, but we > >>> > still avoid > >>> > >> that > >>> > >>>>> we > >>> > >>>>>> need to specify a resource profile on every operator? > >>> > >>>>>> > >>> > >>>>>> What do you think about something like the following: > >>> > >>>>>> - Resource Profiles are specified on an operator level. > >>> > >>>>>> - Not all operators need profiles > >>> > >>>>>> - All Operators without a Resource Profile ended up in > the > >>> > >> default > >>> > >>>> slot > >>> > >>>>>> sharing group with a default profile (will get a default > slot). > >>> > >>>>>> - All Operators with a Resource Profile will go into > >>> > another slot > >>> > >>>>> sharing > >>> > >>>>>> group (the resource-specified-group). > >>> > >>>>>> - Users can define different slot sharing groups for > >>> > operators > >>> > >> like > >>> > >>>>> they > >>> > >>>>>> do now, with the exception that you cannot mix operators > >>> > that have > >>> > >> a > >>> > >>>>>> resource profile and operators that have no resource > profile. > >>> > >>>>>> - The default case where no operator has a resource > >>> > profile is > >>> > >> just a > >>> > >>>>>> special case of this model > >>> > >>>>>> - The chaining logic sums up the profiles per operator, > >>> > like it > >>> > >> does > >>> > >>>>> now, > >>> > >>>>>> and the scheduler sums up the profiles of the tasks that > it > >>> > >> schedules > >>> > >>>>>> together. > >>> > >>>>>> > >>> > >>>>>> > >>> > >>>>>> There is another question about reactive scaling raised > in the > >>> > >> FLIP. I > >>> > >>>>> need > >>> > >>>>>> to think a bit about that. That is indeed a bit more > tricky > >>> > once we > >>> > >>>> have > >>> > >>>>>> slots of different sizes. > >>> > >>>>>> It is not clear then which of the different slot requests > the > >>> > >>>>>> ResourceManager should fulfill when new resources (TMs) > >>> > show up, > >>> > >> or how > >>> > >>>>> the > >>> > >>>>>> JobManager redistributes the slots resources when > resources > >>> > (TMs) > >>> > >>>>> disappear > >>> > >>>>>> This question is pretty orthogonal, though, to the "how to > >>> > specify > >>> > >> the > >>> > >>>>>> resources". > >>> > >>>>>> > >>> > >>>>>> > >>> > >>>>>> Best, > >>> > >>>>>> Stephan > >>> > >>>>>> > >>> > >>>>>> On Fri, Jan 8, 2021 at 5:14 AM Xintong Song > >>> > <tonysong...@gmail.com <mailto:tonysong...@gmail.com> > >>> > >>>>> wrote: > >>> > >>>>>>> Thanks for drafting the FLIP and driving the discussion, > >>> > Yangze. > >>> > >>>>>>> And Thanks for the feedback, Till and Chesnay. > >>> > >>>>>>> > >>> > >>>>>>> @Till, > >>> > >>>>>>> > >>> > >>>>>>> I agree that specifying requirements for SSGs means that > SSGs > >>> > >> need to > >>> > >>>>> be > >>> > >>>>>>> supported in fine-grained resource management, otherwise > each > >>> > >>>> operator > >>> > >>>>>>> might use as many resources as the whole group. However, > I > >>> > cannot > >>> > >>>> think > >>> > >>>>>> of > >>> > >>>>>>> a strong reason for not supporting SSGs in fine-grained > >>> > resource > >>> > >>>>>>> management. > >>> > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>>> Interestingly, if all operators have their resources > properly > >>> > >>>>>> specified, > >>> > >>>>>>>> then slot sharing is no longer needed because Flink > could > >>> > >> slice off > >>> > >>>>> the > >>> > >>>>>>>> appropriately sized slots for every Task individually. > >>> > >>>>>>>> > >>> > >>>>>>> So for example, if we have a job consisting of two > >>> > operator op_1 > >>> > >> and > >>> > >>>>> op_2 > >>> > >>>>>>>> where each op needs 100 MB of memory, we would then say > that > >>> > >> the > >>> > >>>> slot > >>> > >>>>>>>> sharing group needs 200 MB of memory to run. If we have > a > >>> > >> cluster > >>> > >>>>> with > >>> > >>>>>> 2 > >>> > >>>>>>>> TMs with one slot of 100 MB each, then the system > cannot run > >>> > >> this > >>> > >>>>> job. > >>> > >>>>>> If > >>> > >>>>>>>> the resources were specified on an operator level, then > the > >>> > >> system > >>> > >>>>>> could > >>> > >>>>>>>> still make the decision to deploy op_1 to TM_1 and op_2 > to > >>> > >> TM_2. > >>> > >>>>>>> > >>> > >>>>>>> Couldn't agree more that if all operators' requirements > are > >>> > >> properly > >>> > >>>>>>> specified, slot sharing should be no longer needed. I > >>> > think this > >>> > >>>>> exactly > >>> > >>>>>>> disproves the example. If we already know op_1 and op_2 > each > >>> > >> needs > >>> > >>>> 100 > >>> > >>>>> MB > >>> > >>>>>>> of memory, why would we put them in the same group? If > >>> > they are > >>> > >> in > >>> > >>>>>> separate > >>> > >>>>>>> groups, with the proposed approach the system can freely > >>> > deploy > >>> > >> them > >>> > >>>> to > >>> > >>>>>>> either a 200 MB TM or two 100 MB TMs. > >>> > >>>>>>> > >>> > >>>>>>> Moreover, the precondition for not needing slot sharing > is > >>> > having > >>> > >>>>>> resource > >>> > >>>>>>> requirements properly specified for all operators. This > is not > >>> > >> always > >>> > >>>>>>> possible, and usually requires tremendous efforts. One > of the > >>> > >>>> benefits > >>> > >>>>>> for > >>> > >>>>>>> SSG-based requirements is that it allows the user to > freely > >>> > >> decide > >>> > >>>> the > >>> > >>>>>>> granularity, thus efforts they want to pay. I would > >>> > consider SSG > >>> > >> in > >>> > >>>>>>> fine-grained resource management as a group of operators > >>> > that the > >>> > >>>> user > >>> > >>>>>>> would like to specify the total resource for. There can > be > >>> > only > >>> > >> one > >>> > >>>>> group > >>> > >>>>>>> in the job, 2~3 groups dividing the job into a few major > >>> > parts, > >>> > >> or as > >>> > >>>>>> many > >>> > >>>>>>> groups as the number of tasks/operators, depending on how > >>> > >>>> fine-grained > >>> > >>>>>> the > >>> > >>>>>>> user is able to specify the resources. > >>> > >>>>>>> > >>> > >>>>>>> Having to support SSGs might be a constraint. But given > >>> > that all > >>> > >> the > >>> > >>>>>>> current scheduler implementations already support SSGs, I > >>> > tend to > >>> > >>>> think > >>> > >>>>>>> that as an acceptable price for the above discussed > >>> > usability and > >>> > >>>>>>> flexibility. > >>> > >>>>>>> > >>> > >>>>>>> @Chesnay > >>> > >>>>>>> > >>> > >>>>>>> Will declaring them on slot sharing groups not also waste > >>> > >> resources > >>> > >>>> if > >>> > >>>>>> the > >>> > >>>>>>>> parallelism of operators within that group are > different? > >>> > >>>>>>>> > >>> > >>>>>>> Yes. It's a trade-off between usability and resource > >>> > >> utilization. To > >>> > >>>>>> avoid > >>> > >>>>>>> such wasting, the user can define more groups, so that > >>> > each group > >>> > >>>>>> contains > >>> > >>>>>>> less operators and the chance of having operators with > >>> > different > >>> > >>>>>>> parallelism will be reduced. The price is to have more > >>> > resource > >>> > >>>>>>> requirements to specify. > >>> > >>>>>>> > >>> > >>>>>>> It also seems like quite a hassle for users having to > >>> > >> recalculate the > >>> > >>>>>>>> resource requirements if they change the slot sharing. > >>> > >>>>>>>> I'd think that it's not really workable for users that > create > >>> > >> a set > >>> > >>>>> of > >>> > >>>>>>>> re-usable operators which are mixed and matched in their > >>> > >>>>> applications; > >>> > >>>>>>>> managing the resources requirements in such a setting > >>> > would be > >>> > >> a > >>> > >>>>>>>> nightmare, and in the end would require operator-level > >>> > >> requirements > >>> > >>>>> any > >>> > >>>>>>>> way. > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > increases > >>> > >>>>> usability. > >>> > >>>>>>> - As mentioned in my reply to Till's comment, > there's no > >>> > >> reason to > >>> > >>>>> put > >>> > >>>>>>> multiple operators whose individual resource > >>> > requirements are > >>> > >>>>> already > >>> > >>>>>>> known > >>> > >>>>>>> into the same group in fine-grained resource > management. > >>> > >>>>>>> - Even an operator implementation is reused for > multiple > >>> > >>>>> applications, > >>> > >>>>>>> it does not guarantee the same resource requirements. > >>> > During > >>> > >> our > >>> > >>>>> years > >>> > >>>>>>> of > >>> > >>>>>>> practices in Alibaba, with per-operator requirements > >>> > >> specified for > >>> > >>>>>>> Blink's > >>> > >>>>>>> fine-grained resource management, very few users > >>> > (including > >>> > >> our > >>> > >>>>>>> specialists > >>> > >>>>>>> who are dedicated to supporting Blink users) are as > >>> > >> experienced as > >>> > >>>>> to > >>> > >>>>>>> accurately predict/estimate the operator resource > >>> > >> requirements. > >>> > >>>> Most > >>> > >>>>>>> people > >>> > >>>>>>> rely on the execution-time metrics (throughput, > delay, cpu > >>> > >> load, > >>> > >>>>>> memory > >>> > >>>>>>> usage, GC pressure, etc.) to improve the > specification. > >>> > >>>>>>> > >>> > >>>>>>> To sum up: > >>> > >>>>>>> If the user is capable of providing proper resource > >>> > requirements > >>> > >> for > >>> > >>>>>> every > >>> > >>>>>>> operator, that's definitely a good thing and we would not > >>> > need to > >>> > >>>> rely > >>> > >>>>> on > >>> > >>>>>>> the SSGs. However, that shouldn't be a *must* for the > >>> > >> fine-grained > >>> > >>>>>> resource > >>> > >>>>>>> management to work. For those users who are capable and > do not > >>> > >> like > >>> > >>>>>> having > >>> > >>>>>>> to set each operator to a separate SSG, I would be ok to > have > >>> > >> both > >>> > >>>>>>> SSG-based and operator-based runtime interfaces and to > only > >>> > >> fallback > >>> > >>>> to > >>> > >>>>>> the > >>> > >>>>>>> SSG requirements when the operator requirements are not > >>> > >> specified. > >>> > >>>>>> However, > >>> > >>>>>>> as the first step, I think we should prioritise the use > cases > >>> > >> where > >>> > >>>>> users > >>> > >>>>>>> are not that experienced. > >>> > >>>>>>> > >>> > >>>>>>> Thank you~ > >>> > >>>>>>> > >>> > >>>>>>> Xintong Song > >>> > >>>>>>> > >>> > >>>>>>> On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler < > >>> > >> ches...@apache.org <mailto:ches...@apache.org>> > >>> > >>>>>>> wrote: > >>> > >>>>>>> > >>> > >>>>>>>> Will declaring them on slot sharing groups not also > waste > >>> > >> resources > >>> > >>>>> if > >>> > >>>>>>>> the parallelism of operators within that group are > different? > >>> > >>>>>>>> > >>> > >>>>>>>> It also seems like quite a hassle for users having to > >>> > >> recalculate > >>> > >>>> the > >>> > >>>>>>>> resource requirements if they change the slot sharing. > >>> > >>>>>>>> I'd think that it's not really workable for users that > create > >>> > >> a set > >>> > >>>>> of > >>> > >>>>>>>> re-usable operators which are mixed and matched in their > >>> > >>>>> applications; > >>> > >>>>>>>> managing the resources requirements in such a setting > >>> > would be > >>> > >> a > >>> > >>>>>>>> nightmare, and in the end would require operator-level > >>> > >> requirements > >>> > >>>>> any > >>> > >>>>>>>> way. > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > increases > >>> > >>>>> usability. > >>> > >>>>>>>> My main worry is that it if we wire the runtime to work > >>> > on SSGs > >>> > >>>> it's > >>> > >>>>>>>> gonna be difficult to implement more fine-grained > approaches, > >>> > >> which > >>> > >>>>>>>> would not be the case if, for the runtime, they are > always > >>> > >> defined > >>> > >>>> on > >>> > >>>>>> an > >>> > >>>>>>>> operator-level. > >>> > >>>>>>>> > >>> > >>>>>>>> On 1/7/2021 2:42 PM, Till Rohrmann wrote: > >>> > >>>>>>>>> Thanks for drafting this FLIP and starting this > discussion > >>> > >>>> Yangze. > >>> > >>>>>>>>> I like that defining resource requirements on a slot > sharing > >>> > >>>> group > >>> > >>>>>>> makes > >>> > >>>>>>>>> the overall setup easier and improves usability of > resource > >>> > >>>>>>> requirements. > >>> > >>>>>>>>> What I do not like about it is that it changes slot > sharing > >>> > >>>> groups > >>> > >>>>>> from > >>> > >>>>>>>>> being a scheduling hint to something which needs to be > >>> > >> supported > >>> > >>>> in > >>> > >>>>>>> order > >>> > >>>>>>>>> to support fine grained resource requirements. So far, > the > >>> > >> idea > >>> > >>>> of > >>> > >>>>>> slot > >>> > >>>>>>>>> sharing groups was that it tells the system that a set > of > >>> > >>>> operators > >>> > >>>>>> can > >>> > >>>>>>>> be > >>> > >>>>>>>>> deployed in the same slot. But the system still had the > >>> > >> freedom > >>> > >>>> to > >>> > >>>>>> say > >>> > >>>>>>>> that > >>> > >>>>>>>>> it would rather place these tasks in different slots > if it > >>> > >>>> wanted. > >>> > >>>>> If > >>> > >>>>>>> we > >>> > >>>>>>>>> now specify resource requirements on a per slot sharing > >>> > >> group, > >>> > >>>> then > >>> > >>>>>> the > >>> > >>>>>>>>> only option for a scheduler which does not support slot > >>> > >> sharing > >>> > >>>>>> groups > >>> > >>>>>>> is > >>> > >>>>>>>>> to say that every operator in this slot sharing group > >>> > needs a > >>> > >>>> slot > >>> > >>>>>> with > >>> > >>>>>>>> the > >>> > >>>>>>>>> same resources as the whole group. > >>> > >>>>>>>>> > >>> > >>>>>>>>> So for example, if we have a job consisting of two > operator > >>> > >> op_1 > >>> > >>>>> and > >>> > >>>>>>> op_2 > >>> > >>>>>>>>> where each op needs 100 MB of memory, we would then > say that > >>> > >> the > >>> > >>>>> slot > >>> > >>>>>>>>> sharing group needs 200 MB of memory to run. If we > have a > >>> > >> cluster > >>> > >>>>>> with > >>> > >>>>>>> 2 > >>> > >>>>>>>>> TMs with one slot of 100 MB each, then the system > cannot run > >>> > >> this > >>> > >>>>>> job. > >>> > >>>>>>> If > >>> > >>>>>>>>> the resources were specified on an operator level, > then the > >>> > >>>> system > >>> > >>>>>>> could > >>> > >>>>>>>>> still make the decision to deploy op_1 to TM_1 and > op_2 to > >>> > >> TM_2. > >>> > >>>>>>>>> Originally, one of the primary goals of slot sharing > groups > >>> > >> was > >>> > >>>> to > >>> > >>>>>> make > >>> > >>>>>>>> it > >>> > >>>>>>>>> easier for the user to reason about how many slots a > job > >>> > >> needs > >>> > >>>>>>>> independent > >>> > >>>>>>>>> of the actual number of operators in the job. > Interestingly, > >>> > >> if > >>> > >>>> all > >>> > >>>>>>>>> operators have their resources properly specified, > then slot > >>> > >>>>> sharing > >>> > >>>>>> is > >>> > >>>>>>>> no > >>> > >>>>>>>>> longer needed because Flink could slice off the > >>> > appropriately > >>> > >>>> sized > >>> > >>>>>>> slots > >>> > >>>>>>>>> for every Task individually. What matters is whether > the > >>> > >> whole > >>> > >>>>>> cluster > >>> > >>>>>>>> has > >>> > >>>>>>>>> enough resources to run all tasks or not. > >>> > >>>>>>>>> > >>> > >>>>>>>>> Cheers, > >>> > >>>>>>>>> Till > >>> > >>>>>>>>> > >>> > >>>>>>>>> On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo < > >>> > >> karma...@gmail.com <mailto:karma...@gmail.com>> > >>> > >>>>>> wrote: > >>> > >>>>>>>>>> Hi, there, > >>> > >>>>>>>>>> > >>> > >>>>>>>>>> We would like to start a discussion thread on > "FLIP-156: > >>> > >> Runtime > >>> > >>>>>>>>>> Interfaces for Fine-Grained Resource Requirements"[1], > >>> > >> where we > >>> > >>>>>>>>>> propose Slot Sharing Group (SSG) based runtime > interfaces > >>> > >> for > >>> > >>>>>>>>>> specifying fine-grained resource requirements. > >>> > >>>>>>>>>> > >>> > >>>>>>>>>> In this FLIP: > >>> > >>>>>>>>>> - Expound the user story of fine-grained resource > >>> > >> management. > >>> > >>>>>>>>>> - Propose runtime interfaces for specifying SSG-based > >>> > >> resource > >>> > >>>>>>>>>> requirements. > >>> > >>>>>>>>>> - Discuss the pros and cons of the three potential > >>> > >> granularities > >>> > >>>>> for > >>> > >>>>>>>>>> specifying the resource requirements (op, task and > slot > >>> > >> sharing > >>> > >>>>>> group) > >>> > >>>>>>>>>> and explain why we choose the slot sharing group. > >>> > >>>>>>>>>> > >>> > >>>>>>>>>> Please find more details in the FLIP wiki document > [1]. > >>> > >> Looking > >>> > >>>>>>>>>> forward to your feedback. > >>> > >>>>>>>>>> > >>> > >>>>>>>>>> [1] > >>> > >>>>>>>>>> > >>> > >> > >>> > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > >>> > < > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > >>> > >>>>>>>>>> Best, > >>> > >>>>>>>>>> Yangze Guo > >>> > >>>>>>>>>> > >>> > >>>>>>>> > >>> > > >>> >