Thanks everyone for the lively discussion. I'd like to try to summarize the current convergence in the discussion. Please let me know if I got things wrong or missed something crucial here.
Change of this FLIP: - Treat the SSG resource requirements as a hint instead of a restriction for the runtime. That's should be explicitly explained in the JavaDocs. Potential follow-up issues if needed: - Provide operator-level resource configuration interface. - Provide multiple options for deciding resources for SSGs whose requirement is not specified: ** Default slot resource. ** Default operator resource times number of operators. If there are no other issues, I'll update the FLIP accordingly and start a vote thread. Thanks all for the valuable feedback again. Best, Yangze Guo Best, Yangze Guo On Fri, Jan 22, 2021 at 11:30 AM Xintong Song <tonysong...@gmail.com> wrote: > > > FGRuntimeInterface.png > > Thank you~ > > Xintong Song > > > > On Fri, Jan 22, 2021 at 11:11 AM Xintong Song <tonysong...@gmail.com> wrote: >> >> I think Chesnay's proposal could actually work. IIUC, the keypoint is to >> derive operator requirements from SSG requirements on the API side, so that >> the runtime only deals with operator requirements. It's debatable how the >> deriving should be done though. E.g., an alternative could be to evenly >> divide the SSG requirement into requirements of operators in the group. >> >> >> However, I'm not entirely sure which option is more desired. Illustrating my >> understanding in the following figure, in which on the top is Chesnay's >> proposal and on the bottom is the SSG-based proposal in this FLIP. >> >> >> >> I think the major difference between the two approaches is where deriving >> operator requirements from SSG requirements happens. >> >> - Chesnay's proposal simplifies the runtime logic and the interface to >> expose, at the price of moving more complexity (i.e. the deriving) to the >> API side. The question is, where do we prefer to keep the complexity? I'm >> slightly leaning towards having a thin API and keep the complexity in >> runtime if possible. >> >> - Notice that the dash line arrows represent optional steps that are needed >> only for schedulers that do not respect SSGs, which we don't have at the >> moment. If we only look at the solid line arrows, then the SSG-based >> approach is much simpler, without needing to derive and aggregate the >> requirements back and forth. I'm not sure about complicating the current >> design only for the potential future needs. >> >> >> Thank you~ >> >> Xintong Song >> >> >> >> >> On Fri, Jan 22, 2021 at 7:35 AM Chesnay Schepler <ches...@apache.org> wrote: >>> >>> You're raising a good point, but I think I can rectify that with a minor >>> adjustment. >>> >>> Default requirements are whatever the default requirements are, setting >>> the requirements for one operator has no effect on other operators. >>> >>> With these rules, and some API enhancements, the following mockup would >>> replicate the SSG-based behavior: >>> >>> Map<SlotSharingGroupId, Requirements> requirements = ... >>> for slotSharingGroup in env.getSlotSharingGroups() { >>> vertices = slotSharingGroup.getVertices() >>> vertices.first().setRequirements(requirements.get(slotSharingGroup.getID()) >>> vertices.remainint().setRequirements(ZERO) >>> } >>> >>> We could even allow setting requirements on slotsharing-groups >>> colocation-groups and internally translate them accordingly. >>> I can't help but feel this is a plain API issue. >>> >>> On 1/21/2021 9:44 AM, Till Rohrmann wrote: >>> > If I understand you correctly Chesnay, then you want to decouple the >>> > resource requirement specification from the slot sharing group >>> > assignment. Hence, per default all operators would be in the same slot >>> > sharing group. If there is no operator with a resource specification, >>> > then the system would allocate a default slot for it. If there is at >>> > least one operator, then the system would sum up all the specified >>> > resources and allocate a slot of this size. This effectively means >>> > that all unspecified operators will implicitly have a zero resource >>> > requirement. Did I understand your idea correctly? >>> > >>> > I am wondering whether this wouldn't lead to a surprising behaviour >>> > for the user. If the user specifies the resource requirements for a >>> > single operator, then he probably will assume that the other operators >>> > will get the default share of resources and not nothing. >>> > >>> > Cheers, >>> > Till >>> > >>> > On Thu, Jan 21, 2021 at 3:25 AM Chesnay Schepler <ches...@apache.org >>> > <mailto:ches...@apache.org>> wrote: >>> > >>> > Is there even a functional difference between specifying the >>> > requirements for an SSG vs specifying the same requirements on a >>> > single >>> > operator within that group (ideally a colocation group to avoid this >>> > whole hint business)? >>> > >>> > Wouldn't we get the best of both worlds in the latter case? >>> > >>> > Users can take shortcuts to define shared requirements, >>> > but refine them further as needed on a per-operator basis, >>> > without changing semantics of slotsharing groups >>> > nor the runtime being locked into SSG-based requirements. >>> > >>> > (And before anyone argues what happens if slotsharing groups >>> > change or >>> > whatnot, that's a plain API issue that we could surely solve. (A >>> > plain >>> > iteration over slotsharing groups and therein contained operators >>> > would >>> > suffice)). >>> > >>> > On 1/20/2021 6:48 PM, Till Rohrmann wrote: >>> > > Maybe a different minor idea: Would it be possible to treat the SSG >>> > > resource requirements as a hint for the runtime similar to how >>> > slot sharing >>> > > groups are designed at the moment? Meaning that we don't give >>> > the guarantee >>> > > that Flink will always deploy this set of tasks together no >>> > matter what >>> > > comes. If, for example, the runtime can derive by some means the >>> > resource >>> > > requirements for each task based on the requirements for the >>> > SSG, this >>> > > could be possible. One easy strategy would be to give every task >>> > the same >>> > > resources as the whole slot sharing group. Another one could be >>> > > distributing the resources equally among the tasks. This does >>> > not even have >>> > > to be implemented but we would give ourselves the freedom to change >>> > > scheduling if need should arise. >>> > > >>> > > Cheers, >>> > > Till >>> > > >>> > > On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo <karma...@gmail.com >>> > <mailto:karma...@gmail.com>> wrote: >>> > > >>> > >> Thanks for the responses, Till and Xintong. >>> > >> >>> > >> I second Xintong's comment that SSG-based runtime interface >>> > will give >>> > >> us the flexibility to achieve op/task-based approach. That's one of >>> > >> the most important reasons for our design choice. >>> > >> >>> > >> Some cents regarding the default operator resource: >>> > >> - It might be good for the scenario of DataStream jobs. >>> > >> ** For light-weight operators, the accumulative >>> > configuration error >>> > >> will not be significant. Then, the resource of a task used is >>> > >> proportional to the number of operators it contains. >>> > >> ** For heavy operators like join and window or operators >>> > using the >>> > >> external resources, user will turn to the fine-grained resource >>> > >> configuration. >>> > >> - It can increase the stability for the standalone cluster >>> > where task >>> > >> executors registered are heterogeneous(with different default slot >>> > >> resources). >>> > >> - It might not be good for SQL users. The operators that SQL >>> > will be >>> > >> transferred to is a black box to the user. We also do not guarantee >>> > >> the cross-version of consistency of the transformation so far. >>> > >> >>> > >> I think it can be treated as a follow-up work when the fine-grained >>> > >> resource management is end-to-end ready. >>> > >> >>> > >> Best, >>> > >> Yangze Guo >>> > >> >>> > >> >>> > >> On Wed, Jan 20, 2021 at 11:16 AM Xintong Song >>> > <tonysong...@gmail.com <mailto:tonysong...@gmail.com>> >>> > >> wrote: >>> > >>> Thanks for the feedback, Till. >>> > >>> >>> > >>> ## I feel that what you proposed (operator-based + default >>> > value) might >>> > >> be >>> > >>> subsumed by the SSG-based approach. >>> > >>> Thinking of op_1 -> op_2, there are the following 4 cases, >>> > categorized by >>> > >>> whether the resource requirements are known to the users. >>> > >>> >>> > >>> 1. *Both known.* As previously mentioned, there's no >>> > reason to put >>> > >>> multiple operators whose individual resource requirements >>> > are already >>> > >> known >>> > >>> into the same group in fine-grained resource management. >>> > And if op_1 >>> > >> and >>> > >>> op_2 are in different groups, there should be no problem >>> > switching >>> > >> data >>> > >>> exchange mode from pipelined to blocking. This is >>> > equivalent to >>> > >> specifying >>> > >>> operator resource requirements in your proposal. >>> > >>> 2. *op_1 known, op_2 unknown.* Similar to 1), except that >>> > op_2 is in a >>> > >>> SSG whose resource is not specified thus would have the >>> > default slot >>> > >>> resource. This is equivalent to having default operator >>> > resources in >>> > >> your >>> > >>> proposal. >>> > >>> 3. *Both unknown*. The user can either set op_1 and op_2 >>> > to the same >>> > >> SSG >>> > >>> or separate SSGs. >>> > >>> - If op_1 and op_2 are in the same SSG, it will be >>> > equivalent to >>> > >> the >>> > >>> coarse-grained resource management, where op_1 and op_2 >>> > share a >>> > >> default >>> > >>> size slot no matter which data exchange mode is used. >>> > >>> - If op_1 and op_2 are in different SSGs, then each of >>> > them will >>> > >> use >>> > >>> a default size slot. This is equivalent to setting them >>> > with >>> > >> default >>> > >>> operator resources in your proposal. >>> > >>> 4. *Total (pipeline) or max (blocking) of op_1 and op_2 is >>> > known.* >>> > >>> - It is possible that the user learns the total / max >>> > resource >>> > >>> requirement from executing and monitoring the job, >>> > while not >>> > >>> being aware of >>> > >>> individual operator requirements. >>> > >>> - I believe this is the case your proposal does not >>> > cover. And TBH, >>> > >>> this is probably how most users learn the resource >>> > requirements, >>> > >>> according >>> > >>> to my experiences. >>> > >>> - In this case, the user might need to specify >>> > different resources >>> > >> if >>> > >>> he wants to switch the execution mode, which should not >>> > be worse >>> > >> than not >>> > >>> being able to use fine-grained resource management. >>> > >>> >>> > >>> >>> > >>> ## An additional idea inspired by your proposal. >>> > >>> We may provide multiple options for deciding resources for >>> > SSGs whose >>> > >>> requirement is not specified, if needed. >>> > >>> >>> > >>> - Default slot resource (current design) >>> > >>> - Default operator resource times number of operators >>> > (equivalent to >>> > >>> your proposal) >>> > >>> >>> > >>> >>> > >>> ## Exposing internal runtime strategies >>> > >>> Theoretically, yes. Tying to the SSGs, the resource >>> > requirements might be >>> > >>> affected if how SSGs are internally handled changes in future. >>> > >> Practically, >>> > >>> I do not concretely see at the moment what kind of changes we >>> > may want in >>> > >>> future that might conflict with this FLIP proposal, as the >>> > question of >>> > >>> switching data exchange mode answered above. I'd suggest to >>> > not give up >>> > >> the >>> > >>> user friendliness we may gain now for the future problems that >>> > may or may >>> > >>> not exist. >>> > >>> >>> > >>> Moreover, the SSG-based approach has the flexibility to >>> > achieve the >>> > >>> equivalent behavior as the operator-based approach, if we set each >>> > >> operator >>> > >>> (or task) to a separate SSG. We can even provide a shortcut >>> > option to >>> > >>> automatically do that for users, if needed. >>> > >>> >>> > >>> >>> > >>> Thank you~ >>> > >>> >>> > >>> Xintong Song >>> > >>> >>> > >>> >>> > >>> >>> > >>> On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann >>> > <trohrm...@apache.org <mailto:trohrm...@apache.org>> >>> > >> wrote: >>> > >>>> Thanks for the responses Xintong and Stephan, >>> > >>>> >>> > >>>> I agree that being able to define the resource requirements for a >>> > >> group of >>> > >>>> operators is more user friendly. However, my concern is that >>> > we are >>> > >>>> exposing thereby internal runtime strategies which might >>> > limit our >>> > >>>> flexibility to execute a given job. Moreover, the semantics of >>> > >> configuring >>> > >>>> resource requirements for SSGs could break if switching from >>> > streaming >>> > >> to >>> > >>>> batch execution. If one defines the resource requirements for >>> > op_1 -> >>> > >> op_2 >>> > >>>> which run in pipelined mode when using the streaming >>> > execution, then >>> > >> how do >>> > >>>> we interpret these requirements when op_1 -> op_2 are >>> > executed with a >>> > >>>> blocking data exchange in batch execution mode? Consequently, >>> > I am >>> > >> still >>> > >>>> leaning towards Stephan's proposal to set the resource >>> > requirements per >>> > >>>> operator. >>> > >>>> >>> > >>>> Maybe the following proposal makes the configuration easier: >>> > If the >>> > >> user >>> > >>>> wants to use fine-grained resource requirements, then she >>> > needs to >>> > >> specify >>> > >>>> the default size which is used for operators which have no >>> > explicit >>> > >>>> resource annotation. If this holds true, then every operator >>> > would >>> > >> have a >>> > >>>> resource requirement and the system can try to execute the >>> > operators >>> > >> in the >>> > >>>> best possible manner w/o being constrained by how the user >>> > set the SSG >>> > >>>> requirements. >>> > >>>> >>> > >>>> Cheers, >>> > >>>> Till >>> > >>>> >>> > >>>> On Tue, Jan 19, 2021 at 9:09 AM Xintong Song >>> > <tonysong...@gmail.com <mailto:tonysong...@gmail.com>> >>> > >>>> wrote: >>> > >>>> >>> > >>>>> Thanks for the feedback, Stephan. >>> > >>>>> >>> > >>>>> Actually, your proposal has also come to my mind at some >>> > point. And I >>> > >>>> have >>> > >>>>> some concerns about it. >>> > >>>>> >>> > >>>>> >>> > >>>>> 1. It does not give users the same control as the SSG-based >>> > approach. >>> > >>>>> >>> > >>>>> >>> > >>>>> While both approaches do not require specifying for each >>> > operator, >>> > >>>>> SSG-based approach supports the semantic that "some operators >>> > >> together >>> > >>>> use >>> > >>>>> this much resource" while the operator-based approach doesn't. >>> > >>>>> >>> > >>>>> >>> > >>>>> Think of a long pipeline with m operators (o_1, o_2, ..., >>> > o_m), and >>> > >> at >>> > >>>> some >>> > >>>>> point there's an agg o_n (1 < n < m) which significantly >>> > reduces the >>> > >> data >>> > >>>>> amount. One can separate the pipeline into 2 groups SSG_1 >>> > (o_1, ..., >>> > >> o_n) >>> > >>>>> and SSG_2 (o_n+1, ... o_m), so that configuring much higher >>> > >> parallelisms >>> > >>>>> for operators in SSG_1 than for operators in SSG_2 won't >>> > lead to too >>> > >> much >>> > >>>>> wasting of resources. If the two SSGs end up needing different >>> > >> resources, >>> > >>>>> with the SSG-based approach one can directly specify >>> > resources for >>> > >> the >>> > >>>> two >>> > >>>>> groups. However, with the operator-based approach, the user will >>> > >> have to >>> > >>>>> specify resources for each operator in one of the two >>> > groups, and >>> > >> tune >>> > >>>> the >>> > >>>>> default slot resource via configurations to fit the other group. >>> > >>>>> >>> > >>>>> >>> > >>>>> 2. It increases the chance of breaking operator chains. >>> > >>>>> >>> > >>>>> >>> > >>>>> Setting chainnable operators into different slot sharing >>> > groups will >>> > >>>>> prevent them from being chained. In the current implementation, >>> > >>>> downstream >>> > >>>>> operators, if SSG not explicitly specified, will be set to >>> > the same >>> > >> group >>> > >>>>> as the chainable upstream operators (unless multiple upstream >>> > >> operators >>> > >>>> in >>> > >>>>> different groups), to reduce the chance of breaking chains. >>> > >>>>> >>> > >>>>> >>> > >>>>> Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, >>> > deciding >>> > >> SSGs >>> > >>>>> based on whether resource is specified we will easily get >>> > groups like >>> > >>>> (o_1, >>> > >>>>> o_3) & (o_2, o_4), where none of the operators can be >>> > chained. This >>> > >> is >>> > >>>> also >>> > >>>>> possible for the SSG-based approach, but I believe the >>> > chance is much >>> > >>>>> smaller because there's no strong reason for users to >>> > specify the >>> > >> groups >>> > >>>>> with alternate operators like that. We are more likely to >>> > get groups >>> > >> like >>> > >>>>> (o_1, o_2) & (o_3, o_4), where the chain breaks only between >>> > o_2 and >>> > >> o_3. >>> > >>>>> >>> > >>>>> 3. It complicates the system by having two different >>> > mechanisms for >>> > >>>> sharing >>> > >>>>> managed memory in a slot. >>> > >>>>> >>> > >>>>> >>> > >>>>> - In FLIP-141, we introduced the intra-slot managed memory >>> > sharing >>> > >>>>> mechanism, where managed memory is first distributed >>> > according to the >>> > >>>>> consumer type, then further distributed across operators of that >>> > >> consumer >>> > >>>>> type. >>> > >>>>> >>> > >>>>> - With the operator-based approach, managed memory size >>> > specified >>> > >> for an >>> > >>>>> operator should account for all the consumer types of that >>> > operator. >>> > >> That >>> > >>>>> means the managed memory is first distributed across >>> > operators, then >>> > >>>>> distributed to different consumer types of each operator. >>> > >>>>> >>> > >>>>> >>> > >>>>> Unfortunately, the different order of the two calculation >>> > steps can >>> > >> lead >>> > >>>> to >>> > >>>>> different results. To be specific, the semantic of the >>> > configuration >>> > >>>> option >>> > >>>>> `consumer-weights` changed (within a slot vs. within an >>> > operator). >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> To sum up things: >>> > >>>>> >>> > >>>>> While (3) might be a bit more implementation related, I >>> > think (1) >>> > >> and (2) >>> > >>>>> somehow suggest that, the price for the proposed approach to >>> > avoid >>> > >>>>> specifying resource for every operator is that it's not as >>> > >> independent >>> > >>>> from >>> > >>>>> operator chaining and slot sharing as the operator-based >>> > approach >>> > >>>> discussed >>> > >>>>> in the FLIP. >>> > >>>>> >>> > >>>>> >>> > >>>>> Thank you~ >>> > >>>>> >>> > >>>>> Xintong Song >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen >>> > <se...@apache.org <mailto:se...@apache.org>> >>> > >> wrote: >>> > >>>>>> Thanks a lot, Yangze and Xintong for this FLIP. >>> > >>>>>> >>> > >>>>>> I want to say, first of all, that this is super well >>> > written. And >>> > >> the >>> > >>>>>> points that the FLIP makes about how to expose the >>> > configuration to >>> > >>>> users >>> > >>>>>> is exactly the right thing to figure out first. >>> > >>>>>> So good job here! >>> > >>>>>> >>> > >>>>>> About how to let users specify the resource profiles. If I >>> > can sum >>> > >> the >>> > >>>>> FLIP >>> > >>>>>> and previous discussion up in my own words, the problem is the >>> > >>>> following: >>> > >>>>>> Operator-level specification is the simplest and cleanest >>> > approach, >>> > >>>>> because >>> > >>>>>>> it avoids mixing operator configuration (resource) and >>> > >> scheduling. No >>> > >>>>>>> matter what other parameters change (chaining, slot sharing, >>> > >>>> switching >>> > >>>>>>> pipelined and blocking shuffles), the resource profiles >>> > stay the >>> > >>>> same. >>> > >>>>>>> But it would require that a user specifies resources on all >>> > >>>> operators, >>> > >>>>>>> which makes it hard to use. That's why the FLIP suggests going >>> > >> with >>> > >>>>>>> specifying resources on a Sharing-Group. >>> > >>>>>> >>> > >>>>>> I think both thoughts are important, so can we find a solution >>> > >> where >>> > >>>> the >>> > >>>>>> Resource Profiles are specified on an Operator, but we >>> > still avoid >>> > >> that >>> > >>>>> we >>> > >>>>>> need to specify a resource profile on every operator? >>> > >>>>>> >>> > >>>>>> What do you think about something like the following: >>> > >>>>>> - Resource Profiles are specified on an operator level. >>> > >>>>>> - Not all operators need profiles >>> > >>>>>> - All Operators without a Resource Profile ended up in the >>> > >> default >>> > >>>> slot >>> > >>>>>> sharing group with a default profile (will get a default slot). >>> > >>>>>> - All Operators with a Resource Profile will go into >>> > another slot >>> > >>>>> sharing >>> > >>>>>> group (the resource-specified-group). >>> > >>>>>> - Users can define different slot sharing groups for >>> > operators >>> > >> like >>> > >>>>> they >>> > >>>>>> do now, with the exception that you cannot mix operators >>> > that have >>> > >> a >>> > >>>>>> resource profile and operators that have no resource profile. >>> > >>>>>> - The default case where no operator has a resource >>> > profile is >>> > >> just a >>> > >>>>>> special case of this model >>> > >>>>>> - The chaining logic sums up the profiles per operator, >>> > like it >>> > >> does >>> > >>>>> now, >>> > >>>>>> and the scheduler sums up the profiles of the tasks that it >>> > >> schedules >>> > >>>>>> together. >>> > >>>>>> >>> > >>>>>> >>> > >>>>>> There is another question about reactive scaling raised in the >>> > >> FLIP. I >>> > >>>>> need >>> > >>>>>> to think a bit about that. That is indeed a bit more tricky >>> > once we >>> > >>>> have >>> > >>>>>> slots of different sizes. >>> > >>>>>> It is not clear then which of the different slot requests the >>> > >>>>>> ResourceManager should fulfill when new resources (TMs) >>> > show up, >>> > >> or how >>> > >>>>> the >>> > >>>>>> JobManager redistributes the slots resources when resources >>> > (TMs) >>> > >>>>> disappear >>> > >>>>>> This question is pretty orthogonal, though, to the "how to >>> > specify >>> > >> the >>> > >>>>>> resources". >>> > >>>>>> >>> > >>>>>> >>> > >>>>>> Best, >>> > >>>>>> Stephan >>> > >>>>>> >>> > >>>>>> On Fri, Jan 8, 2021 at 5:14 AM Xintong Song >>> > <tonysong...@gmail.com <mailto:tonysong...@gmail.com> >>> > >>>>> wrote: >>> > >>>>>>> Thanks for drafting the FLIP and driving the discussion, >>> > Yangze. >>> > >>>>>>> And Thanks for the feedback, Till and Chesnay. >>> > >>>>>>> >>> > >>>>>>> @Till, >>> > >>>>>>> >>> > >>>>>>> I agree that specifying requirements for SSGs means that SSGs >>> > >> need to >>> > >>>>> be >>> > >>>>>>> supported in fine-grained resource management, otherwise each >>> > >>>> operator >>> > >>>>>>> might use as many resources as the whole group. However, I >>> > cannot >>> > >>>> think >>> > >>>>>> of >>> > >>>>>>> a strong reason for not supporting SSGs in fine-grained >>> > resource >>> > >>>>>>> management. >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>>> Interestingly, if all operators have their resources properly >>> > >>>>>> specified, >>> > >>>>>>>> then slot sharing is no longer needed because Flink could >>> > >> slice off >>> > >>>>> the >>> > >>>>>>>> appropriately sized slots for every Task individually. >>> > >>>>>>>> >>> > >>>>>>> So for example, if we have a job consisting of two >>> > operator op_1 >>> > >> and >>> > >>>>> op_2 >>> > >>>>>>>> where each op needs 100 MB of memory, we would then say that >>> > >> the >>> > >>>> slot >>> > >>>>>>>> sharing group needs 200 MB of memory to run. If we have a >>> > >> cluster >>> > >>>>> with >>> > >>>>>> 2 >>> > >>>>>>>> TMs with one slot of 100 MB each, then the system cannot run >>> > >> this >>> > >>>>> job. >>> > >>>>>> If >>> > >>>>>>>> the resources were specified on an operator level, then the >>> > >> system >>> > >>>>>> could >>> > >>>>>>>> still make the decision to deploy op_1 to TM_1 and op_2 to >>> > >> TM_2. >>> > >>>>>>> >>> > >>>>>>> Couldn't agree more that if all operators' requirements are >>> > >> properly >>> > >>>>>>> specified, slot sharing should be no longer needed. I >>> > think this >>> > >>>>> exactly >>> > >>>>>>> disproves the example. If we already know op_1 and op_2 each >>> > >> needs >>> > >>>> 100 >>> > >>>>> MB >>> > >>>>>>> of memory, why would we put them in the same group? If >>> > they are >>> > >> in >>> > >>>>>> separate >>> > >>>>>>> groups, with the proposed approach the system can freely >>> > deploy >>> > >> them >>> > >>>> to >>> > >>>>>>> either a 200 MB TM or two 100 MB TMs. >>> > >>>>>>> >>> > >>>>>>> Moreover, the precondition for not needing slot sharing is >>> > having >>> > >>>>>> resource >>> > >>>>>>> requirements properly specified for all operators. This is not >>> > >> always >>> > >>>>>>> possible, and usually requires tremendous efforts. One of the >>> > >>>> benefits >>> > >>>>>> for >>> > >>>>>>> SSG-based requirements is that it allows the user to freely >>> > >> decide >>> > >>>> the >>> > >>>>>>> granularity, thus efforts they want to pay. I would >>> > consider SSG >>> > >> in >>> > >>>>>>> fine-grained resource management as a group of operators >>> > that the >>> > >>>> user >>> > >>>>>>> would like to specify the total resource for. There can be >>> > only >>> > >> one >>> > >>>>> group >>> > >>>>>>> in the job, 2~3 groups dividing the job into a few major >>> > parts, >>> > >> or as >>> > >>>>>> many >>> > >>>>>>> groups as the number of tasks/operators, depending on how >>> > >>>> fine-grained >>> > >>>>>> the >>> > >>>>>>> user is able to specify the resources. >>> > >>>>>>> >>> > >>>>>>> Having to support SSGs might be a constraint. But given >>> > that all >>> > >> the >>> > >>>>>>> current scheduler implementations already support SSGs, I >>> > tend to >>> > >>>> think >>> > >>>>>>> that as an acceptable price for the above discussed >>> > usability and >>> > >>>>>>> flexibility. >>> > >>>>>>> >>> > >>>>>>> @Chesnay >>> > >>>>>>> >>> > >>>>>>> Will declaring them on slot sharing groups not also waste >>> > >> resources >>> > >>>> if >>> > >>>>>> the >>> > >>>>>>>> parallelism of operators within that group are different? >>> > >>>>>>>> >>> > >>>>>>> Yes. It's a trade-off between usability and resource >>> > >> utilization. To >>> > >>>>>> avoid >>> > >>>>>>> such wasting, the user can define more groups, so that >>> > each group >>> > >>>>>> contains >>> > >>>>>>> less operators and the chance of having operators with >>> > different >>> > >>>>>>> parallelism will be reduced. The price is to have more >>> > resource >>> > >>>>>>> requirements to specify. >>> > >>>>>>> >>> > >>>>>>> It also seems like quite a hassle for users having to >>> > >> recalculate the >>> > >>>>>>>> resource requirements if they change the slot sharing. >>> > >>>>>>>> I'd think that it's not really workable for users that create >>> > >> a set >>> > >>>>> of >>> > >>>>>>>> re-usable operators which are mixed and matched in their >>> > >>>>> applications; >>> > >>>>>>>> managing the resources requirements in such a setting >>> > would be >>> > >> a >>> > >>>>>>>> nightmare, and in the end would require operator-level >>> > >> requirements >>> > >>>>> any >>> > >>>>>>>> way. >>> > >>>>>>>> In that sense, I'm not even sure whether it really increases >>> > >>>>> usability. >>> > >>>>>>> - As mentioned in my reply to Till's comment, there's no >>> > >> reason to >>> > >>>>> put >>> > >>>>>>> multiple operators whose individual resource >>> > requirements are >>> > >>>>> already >>> > >>>>>>> known >>> > >>>>>>> into the same group in fine-grained resource management. >>> > >>>>>>> - Even an operator implementation is reused for multiple >>> > >>>>> applications, >>> > >>>>>>> it does not guarantee the same resource requirements. >>> > During >>> > >> our >>> > >>>>> years >>> > >>>>>>> of >>> > >>>>>>> practices in Alibaba, with per-operator requirements >>> > >> specified for >>> > >>>>>>> Blink's >>> > >>>>>>> fine-grained resource management, very few users >>> > (including >>> > >> our >>> > >>>>>>> specialists >>> > >>>>>>> who are dedicated to supporting Blink users) are as >>> > >> experienced as >>> > >>>>> to >>> > >>>>>>> accurately predict/estimate the operator resource >>> > >> requirements. >>> > >>>> Most >>> > >>>>>>> people >>> > >>>>>>> rely on the execution-time metrics (throughput, delay, cpu >>> > >> load, >>> > >>>>>> memory >>> > >>>>>>> usage, GC pressure, etc.) to improve the specification. >>> > >>>>>>> >>> > >>>>>>> To sum up: >>> > >>>>>>> If the user is capable of providing proper resource >>> > requirements >>> > >> for >>> > >>>>>> every >>> > >>>>>>> operator, that's definitely a good thing and we would not >>> > need to >>> > >>>> rely >>> > >>>>> on >>> > >>>>>>> the SSGs. However, that shouldn't be a *must* for the >>> > >> fine-grained >>> > >>>>>> resource >>> > >>>>>>> management to work. For those users who are capable and do not >>> > >> like >>> > >>>>>> having >>> > >>>>>>> to set each operator to a separate SSG, I would be ok to have >>> > >> both >>> > >>>>>>> SSG-based and operator-based runtime interfaces and to only >>> > >> fallback >>> > >>>> to >>> > >>>>>> the >>> > >>>>>>> SSG requirements when the operator requirements are not >>> > >> specified. >>> > >>>>>> However, >>> > >>>>>>> as the first step, I think we should prioritise the use cases >>> > >> where >>> > >>>>> users >>> > >>>>>>> are not that experienced. >>> > >>>>>>> >>> > >>>>>>> Thank you~ >>> > >>>>>>> >>> > >>>>>>> Xintong Song >>> > >>>>>>> >>> > >>>>>>> On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler < >>> > >> ches...@apache.org <mailto:ches...@apache.org>> >>> > >>>>>>> wrote: >>> > >>>>>>> >>> > >>>>>>>> Will declaring them on slot sharing groups not also waste >>> > >> resources >>> > >>>>> if >>> > >>>>>>>> the parallelism of operators within that group are different? >>> > >>>>>>>> >>> > >>>>>>>> It also seems like quite a hassle for users having to >>> > >> recalculate >>> > >>>> the >>> > >>>>>>>> resource requirements if they change the slot sharing. >>> > >>>>>>>> I'd think that it's not really workable for users that create >>> > >> a set >>> > >>>>> of >>> > >>>>>>>> re-usable operators which are mixed and matched in their >>> > >>>>> applications; >>> > >>>>>>>> managing the resources requirements in such a setting >>> > would be >>> > >> a >>> > >>>>>>>> nightmare, and in the end would require operator-level >>> > >> requirements >>> > >>>>> any >>> > >>>>>>>> way. >>> > >>>>>>>> In that sense, I'm not even sure whether it really increases >>> > >>>>> usability. >>> > >>>>>>>> My main worry is that it if we wire the runtime to work >>> > on SSGs >>> > >>>> it's >>> > >>>>>>>> gonna be difficult to implement more fine-grained approaches, >>> > >> which >>> > >>>>>>>> would not be the case if, for the runtime, they are always >>> > >> defined >>> > >>>> on >>> > >>>>>> an >>> > >>>>>>>> operator-level. >>> > >>>>>>>> >>> > >>>>>>>> On 1/7/2021 2:42 PM, Till Rohrmann wrote: >>> > >>>>>>>>> Thanks for drafting this FLIP and starting this discussion >>> > >>>> Yangze. >>> > >>>>>>>>> I like that defining resource requirements on a slot sharing >>> > >>>> group >>> > >>>>>>> makes >>> > >>>>>>>>> the overall setup easier and improves usability of resource >>> > >>>>>>> requirements. >>> > >>>>>>>>> What I do not like about it is that it changes slot sharing >>> > >>>> groups >>> > >>>>>> from >>> > >>>>>>>>> being a scheduling hint to something which needs to be >>> > >> supported >>> > >>>> in >>> > >>>>>>> order >>> > >>>>>>>>> to support fine grained resource requirements. So far, the >>> > >> idea >>> > >>>> of >>> > >>>>>> slot >>> > >>>>>>>>> sharing groups was that it tells the system that a set of >>> > >>>> operators >>> > >>>>>> can >>> > >>>>>>>> be >>> > >>>>>>>>> deployed in the same slot. But the system still had the >>> > >> freedom >>> > >>>> to >>> > >>>>>> say >>> > >>>>>>>> that >>> > >>>>>>>>> it would rather place these tasks in different slots if it >>> > >>>> wanted. >>> > >>>>> If >>> > >>>>>>> we >>> > >>>>>>>>> now specify resource requirements on a per slot sharing >>> > >> group, >>> > >>>> then >>> > >>>>>> the >>> > >>>>>>>>> only option for a scheduler which does not support slot >>> > >> sharing >>> > >>>>>> groups >>> > >>>>>>> is >>> > >>>>>>>>> to say that every operator in this slot sharing group >>> > needs a >>> > >>>> slot >>> > >>>>>> with >>> > >>>>>>>> the >>> > >>>>>>>>> same resources as the whole group. >>> > >>>>>>>>> >>> > >>>>>>>>> So for example, if we have a job consisting of two operator >>> > >> op_1 >>> > >>>>> and >>> > >>>>>>> op_2 >>> > >>>>>>>>> where each op needs 100 MB of memory, we would then say that >>> > >> the >>> > >>>>> slot >>> > >>>>>>>>> sharing group needs 200 MB of memory to run. If we have a >>> > >> cluster >>> > >>>>>> with >>> > >>>>>>> 2 >>> > >>>>>>>>> TMs with one slot of 100 MB each, then the system cannot run >>> > >> this >>> > >>>>>> job. >>> > >>>>>>> If >>> > >>>>>>>>> the resources were specified on an operator level, then the >>> > >>>> system >>> > >>>>>>> could >>> > >>>>>>>>> still make the decision to deploy op_1 to TM_1 and op_2 to >>> > >> TM_2. >>> > >>>>>>>>> Originally, one of the primary goals of slot sharing groups >>> > >> was >>> > >>>> to >>> > >>>>>> make >>> > >>>>>>>> it >>> > >>>>>>>>> easier for the user to reason about how many slots a job >>> > >> needs >>> > >>>>>>>> independent >>> > >>>>>>>>> of the actual number of operators in the job. Interestingly, >>> > >> if >>> > >>>> all >>> > >>>>>>>>> operators have their resources properly specified, then slot >>> > >>>>> sharing >>> > >>>>>> is >>> > >>>>>>>> no >>> > >>>>>>>>> longer needed because Flink could slice off the >>> > appropriately >>> > >>>> sized >>> > >>>>>>> slots >>> > >>>>>>>>> for every Task individually. What matters is whether the >>> > >> whole >>> > >>>>>> cluster >>> > >>>>>>>> has >>> > >>>>>>>>> enough resources to run all tasks or not. >>> > >>>>>>>>> >>> > >>>>>>>>> Cheers, >>> > >>>>>>>>> Till >>> > >>>>>>>>> >>> > >>>>>>>>> On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo < >>> > >> karma...@gmail.com <mailto:karma...@gmail.com>> >>> > >>>>>> wrote: >>> > >>>>>>>>>> Hi, there, >>> > >>>>>>>>>> >>> > >>>>>>>>>> We would like to start a discussion thread on "FLIP-156: >>> > >> Runtime >>> > >>>>>>>>>> Interfaces for Fine-Grained Resource Requirements"[1], >>> > >> where we >>> > >>>>>>>>>> propose Slot Sharing Group (SSG) based runtime interfaces >>> > >> for >>> > >>>>>>>>>> specifying fine-grained resource requirements. >>> > >>>>>>>>>> >>> > >>>>>>>>>> In this FLIP: >>> > >>>>>>>>>> - Expound the user story of fine-grained resource >>> > >> management. >>> > >>>>>>>>>> - Propose runtime interfaces for specifying SSG-based >>> > >> resource >>> > >>>>>>>>>> requirements. >>> > >>>>>>>>>> - Discuss the pros and cons of the three potential >>> > >> granularities >>> > >>>>> for >>> > >>>>>>>>>> specifying the resource requirements (op, task and slot >>> > >> sharing >>> > >>>>>> group) >>> > >>>>>>>>>> and explain why we choose the slot sharing group. >>> > >>>>>>>>>> >>> > >>>>>>>>>> Please find more details in the FLIP wiki document [1]. >>> > >> Looking >>> > >>>>>>>>>> forward to your feedback. >>> > >>>>>>>>>> >>> > >>>>>>>>>> [1] >>> > >>>>>>>>>> >>> > >> >>> > >>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements >>> > >>> > <https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements> >>> > >>>>>>>>>> Best, >>> > >>>>>>>>>> Yangze Guo >>> > >>>>>>>>>> >>> > >>>>>>>> >>> > >>>