Re: [DISCUSS] FLIP-156: Runtime Interfaces for Fine-Grained Resource Requirements

Yangze Guo Mon, 25 Jan 2021 19:08:56 -0800

Thanks everyone for the lively discussion. I'd like to try to
summarize the current convergence in the discussion. Please let me
know if I got things wrong or missed something crucial here.


Change of this FLIP:
- Treat the SSG resource requirements as a hint instead of a
restriction for the runtime. That's should be explicitly explained in
the JavaDocs.

Potential follow-up issues if needed:
- Provide operator-level resource configuration interface.
- Provide multiple options for deciding resources for SSGs whose
requirement is not specified:
    ** Default slot resource.
    ** Default operator resource times number of operators.

If there are no other issues, I'll update the FLIP accordingly and
start a vote thread. Thanks all for the valuable feedback again.

Best,
Yangze Guo

Best,
Yangze Guo


On Fri, Jan 22, 2021 at 11:30 AM Xintong Song <[email protected]> wrote:
>
>
>  FGRuntimeInterface.png
>
> Thank you~
>
> Xintong Song
>
>
>
> On Fri, Jan 22, 2021 at 11:11 AM Xintong Song <[email protected]> wrote:
>>
>> I think Chesnay's proposal could actually work. IIUC, the keypoint is to 
>> derive operator requirements from SSG requirements on the API side, so that 
>> the runtime only deals with operator requirements. It's debatable how the 
>> deriving should be done though. E.g., an alternative could be to evenly 
>> divide the SSG requirement into requirements of operators in the group.
>>
>>
>> However, I'm not entirely sure which option is more desired. Illustrating my 
>> understanding in the following figure, in which on the top is Chesnay's 
>> proposal and on the bottom is the SSG-based proposal in this FLIP.
>>
>>
>>
>> I think the major difference between the two approaches is where deriving 
>> operator requirements from SSG requirements happens.
>>
>> - Chesnay's proposal simplifies the runtime logic and the interface to 
>> expose, at the price of moving more complexity (i.e. the deriving) to the 
>> API side. The question is, where do we prefer to keep the complexity? I'm 
>> slightly leaning towards having a thin API and keep the complexity in 
>> runtime if possible.
>>
>> - Notice that the dash line arrows represent optional steps that are needed 
>> only for schedulers that do not respect SSGs, which we don't have at the 
>> moment. If we only look at the solid line arrows, then the SSG-based 
>> approach is much simpler, without needing to derive and aggregate the 
>> requirements back and forth. I'm not sure about complicating the current 
>> design only for the potential future needs.
>>
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>>
>>
>> On Fri, Jan 22, 2021 at 7:35 AM Chesnay Schepler <[email protected]> wrote:
>>>
>>> You're raising a good point, but I think I can rectify that with a minor
>>> adjustment.
>>>
>>> Default requirements are whatever the default requirements are, setting
>>> the requirements for one operator has no effect on other operators.
>>>
>>> With these rules, and some API enhancements, the following mockup would
>>> replicate the SSG-based behavior:
>>>
>>> Map<SlotSharingGroupId, Requirements> requirements = ...
>>> for slotSharingGroup in env.getSlotSharingGroups() {
>>>      vertices = slotSharingGroup.getVertices()
>>> vertices.first().setRequirements(requirements.get(slotSharingGroup.getID())
>>> vertices.remainint().setRequirements(ZERO)
>>> }
>>>
>>> We could even allow setting requirements on slotsharing-groups
>>> colocation-groups and internally translate them accordingly.
>>> I can't help but feel this is a plain API issue.
>>>
>>> On 1/21/2021 9:44 AM, Till Rohrmann wrote:
>>> > If I understand you correctly Chesnay, then you want to decouple the
>>> > resource requirement specification from the slot sharing group
>>> > assignment. Hence, per default all operators would be in the same slot
>>> > sharing group. If there is no operator with a resource specification,
>>> > then the system would allocate a default slot for it. If there is at
>>> > least one operator, then the system would sum up all the specified
>>> > resources and allocate a slot of this size. This effectively means
>>> > that all unspecified operators will implicitly have a zero resource
>>> > requirement. Did I understand your idea correctly?
>>> >
>>> > I am wondering whether this wouldn't lead to a surprising behaviour
>>> > for the user. If the user specifies the resource requirements for a
>>> > single operator, then he probably will assume that the other operators
>>> > will get the default share of resources and not nothing.
>>> >
>>> > Cheers,
>>> > Till
>>> >
>>> > On Thu, Jan 21, 2021 at 3:25 AM Chesnay Schepler <[email protected]
>>> > <mailto:[email protected]>> wrote:
>>> >
>>> >     Is there even a functional difference between specifying the
>>> >     requirements for an SSG vs specifying the same requirements on a
>>> >     single
>>> >     operator within that group (ideally a colocation group to avoid this
>>> >     whole hint business)?
>>> >
>>> >     Wouldn't we get the best of both worlds in the latter case?
>>> >
>>> >     Users can take shortcuts to define shared requirements,
>>> >     but refine them further as needed on a per-operator basis,
>>> >     without changing semantics of slotsharing groups
>>> >     nor the runtime being locked into SSG-based requirements.
>>> >
>>> >     (And before anyone argues what happens if slotsharing groups
>>> >     change or
>>> >     whatnot, that's a plain API issue that we could surely solve. (A
>>> >     plain
>>> >     iteration over slotsharing groups and therein contained operators
>>> >     would
>>> >     suffice)).
>>> >
>>> >     On 1/20/2021 6:48 PM, Till Rohrmann wrote:
>>> >     > Maybe a different minor idea: Would it be possible to treat the SSG
>>> >     > resource requirements as a hint for the runtime similar to how
>>> >     slot sharing
>>> >     > groups are designed at the moment? Meaning that we don't give
>>> >     the guarantee
>>> >     > that Flink will always deploy this set of tasks together no
>>> >     matter what
>>> >     > comes. If, for example, the runtime can derive by some means the
>>> >     resource
>>> >     > requirements for each task based on the requirements for the
>>> >     SSG, this
>>> >     > could be possible. One easy strategy would be to give every task
>>> >     the same
>>> >     > resources as the whole slot sharing group. Another one could be
>>> >     > distributing the resources equally among the tasks. This does
>>> >     not even have
>>> >     > to be implemented but we would give ourselves the freedom to change
>>> >     > scheduling if need should arise.
>>> >     >
>>> >     > Cheers,
>>> >     > Till
>>> >     >
>>> >     > On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo <[email protected]
>>> >     <mailto:[email protected]>> wrote:
>>> >     >
>>> >     >> Thanks for the responses, Till and Xintong.
>>> >     >>
>>> >     >> I second Xintong's comment that SSG-based runtime interface
>>> >     will give
>>> >     >> us the flexibility to achieve op/task-based approach. That's one of
>>> >     >> the most important reasons for our design choice.
>>> >     >>
>>> >     >> Some cents regarding the default operator resource:
>>> >     >> - It might be good for the scenario of DataStream jobs.
>>> >     >>     ** For light-weight operators, the accumulative
>>> >     configuration error
>>> >     >> will not be significant. Then, the resource of a task used is
>>> >     >> proportional to the number of operators it contains.
>>> >     >>     ** For heavy operators like join and window or operators
>>> >     using the
>>> >     >> external resources, user will turn to the fine-grained resource
>>> >     >> configuration.
>>> >     >> - It can increase the stability for the standalone cluster
>>> >     where task
>>> >     >> executors registered are heterogeneous(with different default slot
>>> >     >> resources).
>>> >     >> - It might not be good for SQL users. The operators that SQL
>>> >     will be
>>> >     >> transferred to is a black box to the user. We also do not guarantee
>>> >     >> the cross-version of consistency of the transformation so far.
>>> >     >>
>>> >     >> I think it can be treated as a follow-up work when the fine-grained
>>> >     >> resource management is end-to-end ready.
>>> >     >>
>>> >     >> Best,
>>> >     >> Yangze Guo
>>> >     >>
>>> >     >>
>>> >     >> On Wed, Jan 20, 2021 at 11:16 AM Xintong Song
>>> >     <[email protected] <mailto:[email protected]>>
>>> >     >> wrote:
>>> >     >>> Thanks for the feedback, Till.
>>> >     >>>
>>> >     >>> ## I feel that what you proposed (operator-based + default
>>> >     value) might
>>> >     >> be
>>> >     >>> subsumed by the SSG-based approach.
>>> >     >>> Thinking of op_1 -> op_2, there are the following 4 cases,
>>> >     categorized by
>>> >     >>> whether the resource requirements are known to the users.
>>> >     >>>
>>> >     >>>     1. *Both known.* As previously mentioned, there's no
>>> >     reason to put
>>> >     >>>     multiple operators whose individual resource requirements
>>> >     are already
>>> >     >> known
>>> >     >>>     into the same group in fine-grained resource management.
>>> >     And if op_1
>>> >     >> and
>>> >     >>>     op_2 are in different groups, there should be no problem
>>> >     switching
>>> >     >> data
>>> >     >>>     exchange mode from pipelined to blocking. This is
>>> >     equivalent to
>>> >     >> specifying
>>> >     >>>     operator resource requirements in your proposal.
>>> >     >>>     2. *op_1 known, op_2 unknown.* Similar to 1), except that
>>> >     op_2 is in a
>>> >     >>>     SSG whose resource is not specified thus would have the
>>> >     default slot
>>> >     >>>     resource. This is equivalent to having default operator
>>> >     resources in
>>> >     >> your
>>> >     >>>     proposal.
>>> >     >>>     3. *Both unknown*. The user can either set op_1 and op_2
>>> >     to the same
>>> >     >> SSG
>>> >     >>>     or separate SSGs.
>>> >     >>>        - If op_1 and op_2 are in the same SSG, it will be
>>> >     equivalent to
>>> >     >> the
>>> >     >>>        coarse-grained resource management, where op_1 and op_2
>>> >     share a
>>> >     >> default
>>> >     >>>        size slot no matter which data exchange mode is used.
>>> >     >>>        - If op_1 and op_2 are in different SSGs, then each of
>>> >     them will
>>> >     >> use
>>> >     >>>        a default size slot. This is equivalent to setting them
>>> >     with
>>> >     >> default
>>> >     >>>        operator resources in your proposal.
>>> >     >>>     4. *Total (pipeline) or max (blocking) of op_1 and op_2 is
>>> >     known.*
>>> >     >>>        - It is possible that the user learns the total / max
>>> >     resource
>>> >     >>>        requirement from executing and monitoring the job,
>>> >     while not
>>> >     >>> being aware of
>>> >     >>>        individual operator requirements.
>>> >     >>>        - I believe this is the case your proposal does not
>>> >     cover. And TBH,
>>> >     >>>        this is probably how most users learn the resource
>>> >     requirements,
>>> >     >>> according
>>> >     >>>        to my experiences.
>>> >     >>>        - In this case, the user might need to specify
>>> >     different resources
>>> >     >> if
>>> >     >>>        he wants to switch the execution mode, which should not
>>> >     be worse
>>> >     >> than not
>>> >     >>>        being able to use fine-grained resource management.
>>> >     >>>
>>> >     >>>
>>> >     >>> ## An additional idea inspired by your proposal.
>>> >     >>> We may provide multiple options for deciding resources for
>>> >     SSGs whose
>>> >     >>> requirement is not specified, if needed.
>>> >     >>>
>>> >     >>>     - Default slot resource (current design)
>>> >     >>>     - Default operator resource times number of operators
>>> >     (equivalent to
>>> >     >>>     your proposal)
>>> >     >>>
>>> >     >>>
>>> >     >>> ## Exposing internal runtime strategies
>>> >     >>> Theoretically, yes. Tying to the SSGs, the resource
>>> >     requirements might be
>>> >     >>> affected if how SSGs are internally handled changes in future.
>>> >     >> Practically,
>>> >     >>> I do not concretely see at the moment what kind of changes we
>>> >     may want in
>>> >     >>> future that might conflict with this FLIP proposal, as the
>>> >     question of
>>> >     >>> switching data exchange mode answered above. I'd suggest to
>>> >     not give up
>>> >     >> the
>>> >     >>> user friendliness we may gain now for the future problems that
>>> >     may or may
>>> >     >>> not exist.
>>> >     >>>
>>> >     >>> Moreover, the SSG-based approach has the flexibility to
>>> >     achieve the
>>> >     >>> equivalent behavior as the operator-based approach, if we set each
>>> >     >> operator
>>> >     >>> (or task) to a separate SSG. We can even provide a shortcut
>>> >     option to
>>> >     >>> automatically do that for users, if needed.
>>> >     >>>
>>> >     >>>
>>> >     >>> Thank you~
>>> >     >>>
>>> >     >>> Xintong Song
>>> >     >>>
>>> >     >>>
>>> >     >>>
>>> >     >>> On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann
>>> >     <[email protected] <mailto:[email protected]>>
>>> >     >> wrote:
>>> >     >>>> Thanks for the responses Xintong and Stephan,
>>> >     >>>>
>>> >     >>>> I agree that being able to define the resource requirements for a
>>> >     >> group of
>>> >     >>>> operators is more user friendly. However, my concern is that
>>> >     we are
>>> >     >>>> exposing thereby internal runtime strategies which might
>>> >     limit our
>>> >     >>>> flexibility to execute a given job. Moreover, the semantics of
>>> >     >> configuring
>>> >     >>>> resource requirements for SSGs could break if switching from
>>> >     streaming
>>> >     >> to
>>> >     >>>> batch execution. If one defines the resource requirements for
>>> >     op_1 ->
>>> >     >> op_2
>>> >     >>>> which run in pipelined mode when using the streaming
>>> >     execution, then
>>> >     >> how do
>>> >     >>>> we interpret these requirements when op_1 -> op_2 are
>>> >     executed with a
>>> >     >>>> blocking data exchange in batch execution mode? Consequently,
>>> >     I am
>>> >     >> still
>>> >     >>>> leaning towards Stephan's proposal to set the resource
>>> >     requirements per
>>> >     >>>> operator.
>>> >     >>>>
>>> >     >>>> Maybe the following proposal makes the configuration easier:
>>> >     If the
>>> >     >> user
>>> >     >>>> wants to use fine-grained resource requirements, then she
>>> >     needs to
>>> >     >> specify
>>> >     >>>> the default size which is used for operators which have no
>>> >     explicit
>>> >     >>>> resource annotation. If this holds true, then every operator
>>> >     would
>>> >     >> have a
>>> >     >>>> resource requirement and the system can try to execute the
>>> >     operators
>>> >     >> in the
>>> >     >>>> best possible manner w/o being constrained by how the user
>>> >     set the SSG
>>> >     >>>> requirements.
>>> >     >>>>
>>> >     >>>> Cheers,
>>> >     >>>> Till
>>> >     >>>>
>>> >     >>>> On Tue, Jan 19, 2021 at 9:09 AM Xintong Song
>>> >     <[email protected] <mailto:[email protected]>>
>>> >     >>>> wrote:
>>> >     >>>>
>>> >     >>>>> Thanks for the feedback, Stephan.
>>> >     >>>>>
>>> >     >>>>> Actually, your proposal has also come to my mind at some
>>> >     point. And I
>>> >     >>>> have
>>> >     >>>>> some concerns about it.
>>> >     >>>>>
>>> >     >>>>>
>>> >     >>>>> 1. It does not give users the same control as the SSG-based
>>> >     approach.
>>> >     >>>>>
>>> >     >>>>>
>>> >     >>>>> While both approaches do not require specifying for each
>>> >     operator,
>>> >     >>>>> SSG-based approach supports the semantic that "some operators
>>> >     >> together
>>> >     >>>> use
>>> >     >>>>> this much resource" while the operator-based approach doesn't.
>>> >     >>>>>
>>> >     >>>>>
>>> >     >>>>> Think of a long pipeline with m operators (o_1, o_2, ...,
>>> >     o_m), and
>>> >     >> at
>>> >     >>>> some
>>> >     >>>>> point there's an agg o_n (1 < n < m) which significantly
>>> >     reduces the
>>> >     >> data
>>> >     >>>>> amount. One can separate the pipeline into 2 groups SSG_1
>>> >     (o_1, ...,
>>> >     >> o_n)
>>> >     >>>>> and SSG_2 (o_n+1, ... o_m), so that configuring much higher
>>> >     >> parallelisms
>>> >     >>>>> for operators in SSG_1 than for operators in SSG_2 won't
>>> >     lead to too
>>> >     >> much
>>> >     >>>>> wasting of resources. If the two SSGs end up needing different
>>> >     >> resources,
>>> >     >>>>> with the SSG-based approach one can directly specify
>>> >     resources for
>>> >     >> the
>>> >     >>>> two
>>> >     >>>>> groups. However, with the operator-based approach, the user will
>>> >     >> have to
>>> >     >>>>> specify resources for each operator in one of the two
>>> >     groups, and
>>> >     >> tune
>>> >     >>>> the
>>> >     >>>>> default slot resource via configurations to fit the other group.
>>> >     >>>>>
>>> >     >>>>>
>>> >     >>>>> 2. It increases the chance of breaking operator chains.
>>> >     >>>>>
>>> >     >>>>>
>>> >     >>>>> Setting chainnable operators into different slot sharing
>>> >     groups will
>>> >     >>>>> prevent them from being chained. In the current implementation,
>>> >     >>>> downstream
>>> >     >>>>> operators, if SSG not explicitly specified, will be set to
>>> >     the same
>>> >     >> group
>>> >     >>>>> as the chainable upstream operators (unless multiple upstream
>>> >     >> operators
>>> >     >>>> in
>>> >     >>>>> different groups), to reduce the chance of breaking chains.
>>> >     >>>>>
>>> >     >>>>>
>>> >     >>>>> Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3,
>>> >     deciding
>>> >     >> SSGs
>>> >     >>>>> based on whether resource is specified we will easily get
>>> >     groups like
>>> >     >>>> (o_1,
>>> >     >>>>> o_3) & (o_2, o_4), where none of the operators can be
>>> >     chained. This
>>> >     >> is
>>> >     >>>> also
>>> >     >>>>> possible for the SSG-based approach, but I believe the
>>> >     chance is much
>>> >     >>>>> smaller because there's no strong reason for users to
>>> >     specify the
>>> >     >> groups
>>> >     >>>>> with alternate operators like that. We are more likely to
>>> >     get groups
>>> >     >> like
>>> >     >>>>> (o_1, o_2) & (o_3, o_4), where the chain breaks only between
>>> >     o_2 and
>>> >     >> o_3.
>>> >     >>>>>
>>> >     >>>>> 3. It complicates the system by having two different
>>> >     mechanisms for
>>> >     >>>> sharing
>>> >     >>>>> managed memory in  a slot.
>>> >     >>>>>
>>> >     >>>>>
>>> >     >>>>> - In FLIP-141, we introduced the intra-slot managed memory
>>> >     sharing
>>> >     >>>>> mechanism, where managed memory is first distributed
>>> >     according to the
>>> >     >>>>> consumer type, then further distributed across operators of that
>>> >     >> consumer
>>> >     >>>>> type.
>>> >     >>>>>
>>> >     >>>>> - With the operator-based approach, managed memory size
>>> >     specified
>>> >     >> for an
>>> >     >>>>> operator should account for all the consumer types of that
>>> >     operator.
>>> >     >> That
>>> >     >>>>> means the managed memory is first distributed across
>>> >     operators, then
>>> >     >>>>> distributed to different consumer types of each operator.
>>> >     >>>>>
>>> >     >>>>>
>>> >     >>>>> Unfortunately, the different order of the two calculation
>>> >     steps can
>>> >     >> lead
>>> >     >>>> to
>>> >     >>>>> different results. To be specific, the semantic of the
>>> >     configuration
>>> >     >>>> option
>>> >     >>>>> `consumer-weights` changed (within a slot vs. within an
>>> >     operator).
>>> >     >>>>>
>>> >     >>>>>
>>> >     >>>>>
>>> >     >>>>> To sum up things:
>>> >     >>>>>
>>> >     >>>>> While (3) might be a bit more implementation related, I
>>> >     think (1)
>>> >     >> and (2)
>>> >     >>>>> somehow suggest that, the price for the proposed approach to
>>> >     avoid
>>> >     >>>>> specifying resource for every operator is that it's not as
>>> >     >> independent
>>> >     >>>> from
>>> >     >>>>> operator chaining and slot sharing as the operator-based
>>> >     approach
>>> >     >>>> discussed
>>> >     >>>>> in the FLIP.
>>> >     >>>>>
>>> >     >>>>>
>>> >     >>>>> Thank you~
>>> >     >>>>>
>>> >     >>>>> Xintong Song
>>> >     >>>>>
>>> >     >>>>>
>>> >     >>>>>
>>> >     >>>>> On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen
>>> >     <[email protected] <mailto:[email protected]>>
>>> >     >> wrote:
>>> >     >>>>>> Thanks a lot, Yangze and Xintong for this FLIP.
>>> >     >>>>>>
>>> >     >>>>>> I want to say, first of all, that this is super well
>>> >     written. And
>>> >     >> the
>>> >     >>>>>> points that the FLIP makes about how to expose the
>>> >     configuration to
>>> >     >>>> users
>>> >     >>>>>> is exactly the right thing to figure out first.
>>> >     >>>>>> So good job here!
>>> >     >>>>>>
>>> >     >>>>>> About how to let users specify the resource profiles. If I
>>> >     can sum
>>> >     >> the
>>> >     >>>>> FLIP
>>> >     >>>>>> and previous discussion up in my own words, the problem is the
>>> >     >>>> following:
>>> >     >>>>>> Operator-level specification is the simplest and cleanest
>>> >     approach,
>>> >     >>>>> because
>>> >     >>>>>>> it avoids mixing operator configuration (resource) and
>>> >     >> scheduling. No
>>> >     >>>>>>> matter what other parameters change (chaining, slot sharing,
>>> >     >>>> switching
>>> >     >>>>>>> pipelined and blocking shuffles), the resource profiles
>>> >     stay the
>>> >     >>>> same.
>>> >     >>>>>>> But it would require that a user specifies resources on all
>>> >     >>>> operators,
>>> >     >>>>>>> which makes it hard to use. That's why the FLIP suggests going
>>> >     >> with
>>> >     >>>>>>> specifying resources on a Sharing-Group.
>>> >     >>>>>>
>>> >     >>>>>> I think both thoughts are important, so can we find a solution
>>> >     >> where
>>> >     >>>> the
>>> >     >>>>>> Resource Profiles are specified on an Operator, but we
>>> >     still avoid
>>> >     >> that
>>> >     >>>>> we
>>> >     >>>>>> need to specify a resource profile on every operator?
>>> >     >>>>>>
>>> >     >>>>>> What do you think about something like the following:
>>> >     >>>>>>    - Resource Profiles are specified on an operator level.
>>> >     >>>>>>    - Not all operators need profiles
>>> >     >>>>>>    - All Operators without a Resource Profile ended up in the
>>> >     >> default
>>> >     >>>> slot
>>> >     >>>>>> sharing group with a default profile (will get a default slot).
>>> >     >>>>>>    - All Operators with a Resource Profile will go into
>>> >     another slot
>>> >     >>>>> sharing
>>> >     >>>>>> group (the resource-specified-group).
>>> >     >>>>>>    - Users can define different slot sharing groups for
>>> >     operators
>>> >     >> like
>>> >     >>>>> they
>>> >     >>>>>> do now, with the exception that you cannot mix operators
>>> >     that have
>>> >     >> a
>>> >     >>>>>> resource profile and operators that have no resource profile.
>>> >     >>>>>>    - The default case where no operator has a resource
>>> >     profile is
>>> >     >> just a
>>> >     >>>>>> special case of this model
>>> >     >>>>>>    - The chaining logic sums up the profiles per operator,
>>> >     like it
>>> >     >> does
>>> >     >>>>> now,
>>> >     >>>>>> and the scheduler sums up the profiles of the tasks that it
>>> >     >> schedules
>>> >     >>>>>> together.
>>> >     >>>>>>
>>> >     >>>>>>
>>> >     >>>>>> There is another question about reactive scaling raised in the
>>> >     >> FLIP. I
>>> >     >>>>> need
>>> >     >>>>>> to think a bit about that. That is indeed a bit more tricky
>>> >     once we
>>> >     >>>> have
>>> >     >>>>>> slots of different sizes.
>>> >     >>>>>> It is not clear then which of the different slot requests the
>>> >     >>>>>> ResourceManager should fulfill when new resources (TMs)
>>> >     show up,
>>> >     >> or how
>>> >     >>>>> the
>>> >     >>>>>> JobManager redistributes the slots resources when resources
>>> >     (TMs)
>>> >     >>>>> disappear
>>> >     >>>>>> This question is pretty orthogonal, though, to the "how to
>>> >     specify
>>> >     >> the
>>> >     >>>>>> resources".
>>> >     >>>>>>
>>> >     >>>>>>
>>> >     >>>>>> Best,
>>> >     >>>>>> Stephan
>>> >     >>>>>>
>>> >     >>>>>> On Fri, Jan 8, 2021 at 5:14 AM Xintong Song
>>> >     <[email protected] <mailto:[email protected]>
>>> >     >>>>> wrote:
>>> >     >>>>>>> Thanks for drafting the FLIP and driving the discussion,
>>> >     Yangze.
>>> >     >>>>>>> And Thanks for the feedback, Till and Chesnay.
>>> >     >>>>>>>
>>> >     >>>>>>> @Till,
>>> >     >>>>>>>
>>> >     >>>>>>> I agree that specifying requirements for SSGs means that SSGs
>>> >     >> need to
>>> >     >>>>> be
>>> >     >>>>>>> supported in fine-grained resource management, otherwise each
>>> >     >>>> operator
>>> >     >>>>>>> might use as many resources as the whole group. However, I
>>> >     cannot
>>> >     >>>> think
>>> >     >>>>>> of
>>> >     >>>>>>> a strong reason for not supporting SSGs in fine-grained
>>> >     resource
>>> >     >>>>>>> management.
>>> >     >>>>>>>
>>> >     >>>>>>>
>>> >     >>>>>>>> Interestingly, if all operators have their resources properly
>>> >     >>>>>> specified,
>>> >     >>>>>>>> then slot sharing is no longer needed because Flink could
>>> >     >> slice off
>>> >     >>>>> the
>>> >     >>>>>>>> appropriately sized slots for every Task individually.
>>> >     >>>>>>>>
>>> >     >>>>>>> So for example, if we have a job consisting of two
>>> >     operator op_1
>>> >     >> and
>>> >     >>>>> op_2
>>> >     >>>>>>>> where each op needs 100 MB of memory, we would then say that
>>> >     >> the
>>> >     >>>> slot
>>> >     >>>>>>>> sharing group needs 200 MB of memory to run. If we have a
>>> >     >> cluster
>>> >     >>>>> with
>>> >     >>>>>> 2
>>> >     >>>>>>>> TMs with one slot of 100 MB each, then the system cannot run
>>> >     >> this
>>> >     >>>>> job.
>>> >     >>>>>> If
>>> >     >>>>>>>> the resources were specified on an operator level, then the
>>> >     >> system
>>> >     >>>>>> could
>>> >     >>>>>>>> still make the decision to deploy op_1 to TM_1 and op_2 to
>>> >     >> TM_2.
>>> >     >>>>>>>
>>> >     >>>>>>> Couldn't agree more that if all operators' requirements are
>>> >     >> properly
>>> >     >>>>>>> specified, slot sharing should be no longer needed. I
>>> >     think this
>>> >     >>>>> exactly
>>> >     >>>>>>> disproves the example. If we already know op_1 and op_2 each
>>> >     >> needs
>>> >     >>>> 100
>>> >     >>>>> MB
>>> >     >>>>>>> of memory, why would we put them in the same group? If
>>> >     they are
>>> >     >> in
>>> >     >>>>>> separate
>>> >     >>>>>>> groups, with the proposed approach the system can freely
>>> >     deploy
>>> >     >> them
>>> >     >>>> to
>>> >     >>>>>>> either a 200 MB TM or two 100 MB TMs.
>>> >     >>>>>>>
>>> >     >>>>>>> Moreover, the precondition for not needing slot sharing is
>>> >     having
>>> >     >>>>>> resource
>>> >     >>>>>>> requirements properly specified for all operators. This is not
>>> >     >> always
>>> >     >>>>>>> possible, and usually requires tremendous efforts. One of the
>>> >     >>>> benefits
>>> >     >>>>>> for
>>> >     >>>>>>> SSG-based requirements is that it allows the user to freely
>>> >     >> decide
>>> >     >>>> the
>>> >     >>>>>>> granularity, thus efforts they want to pay. I would
>>> >     consider SSG
>>> >     >> in
>>> >     >>>>>>> fine-grained resource management as a group of operators
>>> >     that the
>>> >     >>>> user
>>> >     >>>>>>> would like to specify the total resource for. There can be
>>> >     only
>>> >     >> one
>>> >     >>>>> group
>>> >     >>>>>>> in the job, 2~3 groups dividing the job into a few major
>>> >     parts,
>>> >     >> or as
>>> >     >>>>>> many
>>> >     >>>>>>> groups as the number of tasks/operators, depending on how
>>> >     >>>> fine-grained
>>> >     >>>>>> the
>>> >     >>>>>>> user is able to specify the resources.
>>> >     >>>>>>>
>>> >     >>>>>>> Having to support SSGs might be a constraint. But given
>>> >     that all
>>> >     >> the
>>> >     >>>>>>> current scheduler implementations already support SSGs, I
>>> >     tend to
>>> >     >>>> think
>>> >     >>>>>>> that as an acceptable price for the above discussed
>>> >     usability and
>>> >     >>>>>>> flexibility.
>>> >     >>>>>>>
>>> >     >>>>>>> @Chesnay
>>> >     >>>>>>>
>>> >     >>>>>>> Will declaring them on slot sharing groups not also waste
>>> >     >> resources
>>> >     >>>> if
>>> >     >>>>>> the
>>> >     >>>>>>>> parallelism of operators within that group are different?
>>> >     >>>>>>>>
>>> >     >>>>>>> Yes. It's a trade-off between usability and resource
>>> >     >> utilization. To
>>> >     >>>>>> avoid
>>> >     >>>>>>> such wasting, the user can define more groups, so that
>>> >     each group
>>> >     >>>>>> contains
>>> >     >>>>>>> less operators and the chance of having operators with
>>> >     different
>>> >     >>>>>>> parallelism will be reduced. The price is to have more
>>> >     resource
>>> >     >>>>>>> requirements to specify.
>>> >     >>>>>>>
>>> >     >>>>>>> It also seems like quite a hassle for users having to
>>> >     >> recalculate the
>>> >     >>>>>>>> resource requirements if they change the slot sharing.
>>> >     >>>>>>>> I'd think that it's not really workable for users that create
>>> >     >> a set
>>> >     >>>>> of
>>> >     >>>>>>>> re-usable operators which are mixed and matched in their
>>> >     >>>>> applications;
>>> >     >>>>>>>> managing the resources requirements in such a setting
>>> >     would be
>>> >     >> a
>>> >     >>>>>>>> nightmare, and in the end would require operator-level
>>> >     >> requirements
>>> >     >>>>> any
>>> >     >>>>>>>> way.
>>> >     >>>>>>>> In that sense, I'm not even sure whether it really increases
>>> >     >>>>> usability.
>>> >     >>>>>>>     - As mentioned in my reply to Till's comment, there's no
>>> >     >> reason to
>>> >     >>>>> put
>>> >     >>>>>>>     multiple operators whose individual resource
>>> >     requirements are
>>> >     >>>>> already
>>> >     >>>>>>> known
>>> >     >>>>>>>     into the same group in fine-grained resource management.
>>> >     >>>>>>>     - Even an operator implementation is reused for multiple
>>> >     >>>>> applications,
>>> >     >>>>>>>     it does not guarantee the same resource requirements.
>>> >     During
>>> >     >> our
>>> >     >>>>> years
>>> >     >>>>>>> of
>>> >     >>>>>>>     practices in Alibaba, with per-operator requirements
>>> >     >> specified for
>>> >     >>>>>>> Blink's
>>> >     >>>>>>>     fine-grained resource management, very few users
>>> >     (including
>>> >     >> our
>>> >     >>>>>>> specialists
>>> >     >>>>>>>     who are dedicated to supporting Blink users) are as
>>> >     >> experienced as
>>> >     >>>>> to
>>> >     >>>>>>>     accurately predict/estimate the operator resource
>>> >     >> requirements.
>>> >     >>>> Most
>>> >     >>>>>>> people
>>> >     >>>>>>>     rely on the execution-time metrics (throughput, delay, cpu
>>> >     >> load,
>>> >     >>>>>> memory
>>> >     >>>>>>>     usage, GC pressure, etc.) to improve the specification.
>>> >     >>>>>>>
>>> >     >>>>>>> To sum up:
>>> >     >>>>>>> If the user is capable of providing proper resource
>>> >     requirements
>>> >     >> for
>>> >     >>>>>> every
>>> >     >>>>>>> operator, that's definitely a good thing and we would not
>>> >     need to
>>> >     >>>> rely
>>> >     >>>>> on
>>> >     >>>>>>> the SSGs. However, that shouldn't be a *must* for the
>>> >     >> fine-grained
>>> >     >>>>>> resource
>>> >     >>>>>>> management to work. For those users who are capable and do not
>>> >     >> like
>>> >     >>>>>> having
>>> >     >>>>>>> to set each operator to a separate SSG, I would be ok to have
>>> >     >> both
>>> >     >>>>>>> SSG-based and operator-based runtime interfaces and to only
>>> >     >> fallback
>>> >     >>>> to
>>> >     >>>>>> the
>>> >     >>>>>>> SSG requirements when the operator requirements are not
>>> >     >> specified.
>>> >     >>>>>> However,
>>> >     >>>>>>> as the first step, I think we should prioritise the use cases
>>> >     >> where
>>> >     >>>>> users
>>> >     >>>>>>> are not that experienced.
>>> >     >>>>>>>
>>> >     >>>>>>> Thank you~
>>> >     >>>>>>>
>>> >     >>>>>>> Xintong Song
>>> >     >>>>>>>
>>> >     >>>>>>> On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler <
>>> >     >> [email protected] <mailto:[email protected]>>
>>> >     >>>>>>> wrote:
>>> >     >>>>>>>
>>> >     >>>>>>>> Will declaring them on slot sharing groups not also waste
>>> >     >> resources
>>> >     >>>>> if
>>> >     >>>>>>>> the parallelism of operators within that group are different?
>>> >     >>>>>>>>
>>> >     >>>>>>>> It also seems like quite a hassle for users having to
>>> >     >> recalculate
>>> >     >>>> the
>>> >     >>>>>>>> resource requirements if they change the slot sharing.
>>> >     >>>>>>>> I'd think that it's not really workable for users that create
>>> >     >> a set
>>> >     >>>>> of
>>> >     >>>>>>>> re-usable operators which are mixed and matched in their
>>> >     >>>>> applications;
>>> >     >>>>>>>> managing the resources requirements in such a setting
>>> >     would be
>>> >     >> a
>>> >     >>>>>>>> nightmare, and in the end would require operator-level
>>> >     >> requirements
>>> >     >>>>> any
>>> >     >>>>>>>> way.
>>> >     >>>>>>>> In that sense, I'm not even sure whether it really increases
>>> >     >>>>> usability.
>>> >     >>>>>>>> My main worry is that it if we wire the runtime to work
>>> >     on SSGs
>>> >     >>>> it's
>>> >     >>>>>>>> gonna be difficult to implement more fine-grained approaches,
>>> >     >> which
>>> >     >>>>>>>> would not be the case if, for the runtime, they are always
>>> >     >> defined
>>> >     >>>> on
>>> >     >>>>>> an
>>> >     >>>>>>>> operator-level.
>>> >     >>>>>>>>
>>> >     >>>>>>>> On 1/7/2021 2:42 PM, Till Rohrmann wrote:
>>> >     >>>>>>>>> Thanks for drafting this FLIP and starting this discussion
>>> >     >>>> Yangze.
>>> >     >>>>>>>>> I like that defining resource requirements on a slot sharing
>>> >     >>>> group
>>> >     >>>>>>> makes
>>> >     >>>>>>>>> the overall setup easier and improves usability of resource
>>> >     >>>>>>> requirements.
>>> >     >>>>>>>>> What I do not like about it is that it changes slot sharing
>>> >     >>>> groups
>>> >     >>>>>> from
>>> >     >>>>>>>>> being a scheduling hint to something which needs to be
>>> >     >> supported
>>> >     >>>> in
>>> >     >>>>>>> order
>>> >     >>>>>>>>> to support fine grained resource requirements. So far, the
>>> >     >> idea
>>> >     >>>> of
>>> >     >>>>>> slot
>>> >     >>>>>>>>> sharing groups was that it tells the system that a set of
>>> >     >>>> operators
>>> >     >>>>>> can
>>> >     >>>>>>>> be
>>> >     >>>>>>>>> deployed in the same slot. But the system still had the
>>> >     >> freedom
>>> >     >>>> to
>>> >     >>>>>> say
>>> >     >>>>>>>> that
>>> >     >>>>>>>>> it would rather place these tasks in different slots if it
>>> >     >>>> wanted.
>>> >     >>>>> If
>>> >     >>>>>>> we
>>> >     >>>>>>>>> now specify resource requirements on a per slot sharing
>>> >     >> group,
>>> >     >>>> then
>>> >     >>>>>> the
>>> >     >>>>>>>>> only option for a scheduler which does not support slot
>>> >     >> sharing
>>> >     >>>>>> groups
>>> >     >>>>>>> is
>>> >     >>>>>>>>> to say that every operator in this slot sharing group
>>> >     needs a
>>> >     >>>> slot
>>> >     >>>>>> with
>>> >     >>>>>>>> the
>>> >     >>>>>>>>> same resources as the whole group.
>>> >     >>>>>>>>>
>>> >     >>>>>>>>> So for example, if we have a job consisting of two operator
>>> >     >> op_1
>>> >     >>>>> and
>>> >     >>>>>>> op_2
>>> >     >>>>>>>>> where each op needs 100 MB of memory, we would then say that
>>> >     >> the
>>> >     >>>>> slot
>>> >     >>>>>>>>> sharing group needs 200 MB of memory to run. If we have a
>>> >     >> cluster
>>> >     >>>>>> with
>>> >     >>>>>>> 2
>>> >     >>>>>>>>> TMs with one slot of 100 MB each, then the system cannot run
>>> >     >> this
>>> >     >>>>>> job.
>>> >     >>>>>>> If
>>> >     >>>>>>>>> the resources were specified on an operator level, then the
>>> >     >>>> system
>>> >     >>>>>>> could
>>> >     >>>>>>>>> still make the decision to deploy op_1 to TM_1 and op_2 to
>>> >     >> TM_2.
>>> >     >>>>>>>>> Originally, one of the primary goals of slot sharing groups
>>> >     >> was
>>> >     >>>> to
>>> >     >>>>>> make
>>> >     >>>>>>>> it
>>> >     >>>>>>>>> easier for the user to reason about how many slots a job
>>> >     >> needs
>>> >     >>>>>>>> independent
>>> >     >>>>>>>>> of the actual number of operators in the job. Interestingly,
>>> >     >> if
>>> >     >>>> all
>>> >     >>>>>>>>> operators have their resources properly specified, then slot
>>> >     >>>>> sharing
>>> >     >>>>>> is
>>> >     >>>>>>>> no
>>> >     >>>>>>>>> longer needed because Flink could slice off the
>>> >     appropriately
>>> >     >>>> sized
>>> >     >>>>>>> slots
>>> >     >>>>>>>>> for every Task individually. What matters is whether the
>>> >     >> whole
>>> >     >>>>>> cluster
>>> >     >>>>>>>> has
>>> >     >>>>>>>>> enough resources to run all tasks or not.
>>> >     >>>>>>>>>
>>> >     >>>>>>>>> Cheers,
>>> >     >>>>>>>>> Till
>>> >     >>>>>>>>>
>>> >     >>>>>>>>> On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo <
>>> >     >> [email protected] <mailto:[email protected]>>
>>> >     >>>>>> wrote:
>>> >     >>>>>>>>>> Hi, there,
>>> >     >>>>>>>>>>
>>> >     >>>>>>>>>> We would like to start a discussion thread on "FLIP-156:
>>> >     >> Runtime
>>> >     >>>>>>>>>> Interfaces for Fine-Grained Resource Requirements"[1],
>>> >     >> where we
>>> >     >>>>>>>>>> propose Slot Sharing Group (SSG) based runtime interfaces
>>> >     >> for
>>> >     >>>>>>>>>> specifying fine-grained resource requirements.
>>> >     >>>>>>>>>>
>>> >     >>>>>>>>>> In this FLIP:
>>> >     >>>>>>>>>> - Expound the user story of fine-grained resource
>>> >     >> management.
>>> >     >>>>>>>>>> - Propose runtime interfaces for specifying SSG-based
>>> >     >> resource
>>> >     >>>>>>>>>> requirements.
>>> >     >>>>>>>>>> - Discuss the pros and cons of the three potential
>>> >     >> granularities
>>> >     >>>>> for
>>> >     >>>>>>>>>> specifying the resource requirements (op, task and slot
>>> >     >> sharing
>>> >     >>>>>> group)
>>> >     >>>>>>>>>> and explain why we choose the slot sharing group.
>>> >     >>>>>>>>>>
>>> >     >>>>>>>>>> Please find more details in the FLIP wiki document [1].
>>> >     >> Looking
>>> >     >>>>>>>>>> forward to your feedback.
>>> >     >>>>>>>>>>
>>> >     >>>>>>>>>> [1]
>>> >     >>>>>>>>>>
>>> >     >>
>>> >     
>>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements
>>> >     
>>> > <https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements>
>>> >     >>>>>>>>>> Best,
>>> >     >>>>>>>>>> Yangze Guo
>>> >     >>>>>>>>>>
>>> >     >>>>>>>>
>>> >
>>>

Re: [DISCUSS] FLIP-156: Runtime Interfaces for Fine-Grained Resource Requirements

Reply via email to