Re: [DISCUSS] FLIP-156: Runtime Interfaces for Fine-Grained Resource Requirements

Yangze Guo Tue, 19 Jan 2021 22:04:29 -0800

Thanks for the responses, Till and Xintong.

I second Xintong's comment that SSG-based runtime interface will give
us the flexibility to achieve op/task-based approach. That's one of
the most important reasons for our design choice.


Some cents regarding the default operator resource:
- It might be good for the scenario of DataStream jobs.
   ** For light-weight operators, the accumulative configuration error
will not be significant. Then, the resource of a task used is
proportional to the number of operators it contains.
   ** For heavy operators like join and window or operators using the
external resources, user will turn to the fine-grained resource
configuration.
- It can increase the stability for the standalone cluster where task
executors registered are heterogeneous(with different default slot
resources).
- It might not be good for SQL users. The operators that SQL will be
transferred to is a black box to the user. We also do not guarantee
the cross-version of consistency of the transformation so far.

I think it can be treated as a follow-up work when the fine-grained
resource management is end-to-end ready.

Best,
Yangze Guo


On Wed, Jan 20, 2021 at 11:16 AM Xintong Song <tonysong...@gmail.com> wrote:
>
> Thanks for the feedback, Till.
>
> ## I feel that what you proposed (operator-based + default value) might be
> subsumed by the SSG-based approach.
> Thinking of op_1 -> op_2, there are the following 4 cases, categorized by
> whether the resource requirements are known to the users.
>
>    1. *Both known.* As previously mentioned, there's no reason to put
>    multiple operators whose individual resource requirements are already known
>    into the same group in fine-grained resource management. And if op_1 and
>    op_2 are in different groups, there should be no problem switching data
>    exchange mode from pipelined to blocking. This is equivalent to specifying
>    operator resource requirements in your proposal.
>    2. *op_1 known, op_2 unknown.* Similar to 1), except that op_2 is in a
>    SSG whose resource is not specified thus would have the default slot
>    resource. This is equivalent to having default operator resources in your
>    proposal.
>    3. *Both unknown*. The user can either set op_1 and op_2 to the same SSG
>    or separate SSGs.
>       - If op_1 and op_2 are in the same SSG, it will be equivalent to the
>       coarse-grained resource management, where op_1 and op_2 share a default
>       size slot no matter which data exchange mode is used.
>       - If op_1 and op_2 are in different SSGs, then each of them will use
>       a default size slot. This is equivalent to setting them with default
>       operator resources in your proposal.
>    4. *Total (pipeline) or max (blocking) of op_1 and op_2 is known.*
>       - It is possible that the user learns the total / max resource
>       requirement from executing and monitoring the job, while not
> being aware of
>       individual operator requirements.
>       - I believe this is the case your proposal does not cover. And TBH,
>       this is probably how most users learn the resource requirements,
> according
>       to my experiences.
>       - In this case, the user might need to specify different resources if
>       he wants to switch the execution mode, which should not be worse than 
> not
>       being able to use fine-grained resource management.
>
>
> ## An additional idea inspired by your proposal.
> We may provide multiple options for deciding resources for SSGs whose
> requirement is not specified, if needed.
>
>    - Default slot resource (current design)
>    - Default operator resource times number of operators (equivalent to
>    your proposal)
>
>
> ## Exposing internal runtime strategies
> Theoretically, yes. Tying to the SSGs, the resource requirements might be
> affected if how SSGs are internally handled changes in future. Practically,
> I do not concretely see at the moment what kind of changes we may want in
> future that might conflict with this FLIP proposal, as the question of
> switching data exchange mode answered above. I'd suggest to not give up the
> user friendliness we may gain now for the future problems that may or may
> not exist.
>
> Moreover, the SSG-based approach has the flexibility to achieve the
> equivalent behavior as the operator-based approach, if we set each operator
> (or task) to a separate SSG. We can even provide a shortcut option to
> automatically do that for users, if needed.
>
>
> Thank you~
>
> Xintong Song
>
>
>
> On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann <trohrm...@apache.org> wrote:
>
> > Thanks for the responses Xintong and Stephan,
> >
> > I agree that being able to define the resource requirements for a group of
> > operators is more user friendly. However, my concern is that we are
> > exposing thereby internal runtime strategies which might limit our
> > flexibility to execute a given job. Moreover, the semantics of configuring
> > resource requirements for SSGs could break if switching from streaming to
> > batch execution. If one defines the resource requirements for op_1 -> op_2
> > which run in pipelined mode when using the streaming execution, then how do
> > we interpret these requirements when op_1 -> op_2 are executed with a
> > blocking data exchange in batch execution mode? Consequently, I am still
> > leaning towards Stephan's proposal to set the resource requirements per
> > operator.
> >
> > Maybe the following proposal makes the configuration easier: If the user
> > wants to use fine-grained resource requirements, then she needs to specify
> > the default size which is used for operators which have no explicit
> > resource annotation. If this holds true, then every operator would have a
> > resource requirement and the system can try to execute the operators in the
> > best possible manner w/o being constrained by how the user set the SSG
> > requirements.
> >
> > Cheers,
> > Till
> >
> > On Tue, Jan 19, 2021 at 9:09 AM Xintong Song <tonysong...@gmail.com>
> > wrote:
> >
> > > Thanks for the feedback, Stephan.
> > >
> > > Actually, your proposal has also come to my mind at some point. And I
> > have
> > > some concerns about it.
> > >
> > >
> > > 1. It does not give users the same control as the SSG-based approach.
> > >
> > >
> > > While both approaches do not require specifying for each operator,
> > > SSG-based approach supports the semantic that "some operators together
> > use
> > > this much resource" while the operator-based approach doesn't.
> > >
> > >
> > > Think of a long pipeline with m operators (o_1, o_2, ..., o_m), and at
> > some
> > > point there's an agg o_n (1 < n < m) which significantly reduces the data
> > > amount. One can separate the pipeline into 2 groups SSG_1 (o_1, ..., o_n)
> > > and SSG_2 (o_n+1, ... o_m), so that configuring much higher parallelisms
> > > for operators in SSG_1 than for operators in SSG_2 won't lead to too much
> > > wasting of resources. If the two SSGs end up needing different resources,
> > > with the SSG-based approach one can directly specify resources for the
> > two
> > > groups. However, with the operator-based approach, the user will have to
> > > specify resources for each operator in one of the two groups, and tune
> > the
> > > default slot resource via configurations to fit the other group.
> > >
> > >
> > > 2. It increases the chance of breaking operator chains.
> > >
> > >
> > > Setting chainnable operators into different slot sharing groups will
> > > prevent them from being chained. In the current implementation,
> > downstream
> > > operators, if SSG not explicitly specified, will be set to the same group
> > > as the chainable upstream operators (unless multiple upstream operators
> > in
> > > different groups), to reduce the chance of breaking chains.
> > >
> > >
> > > Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, deciding SSGs
> > > based on whether resource is specified we will easily get groups like
> > (o_1,
> > > o_3) & (o_2, o_4), where none of the operators can be chained. This is
> > also
> > > possible for the SSG-based approach, but I believe the chance is much
> > > smaller because there's no strong reason for users to specify the groups
> > > with alternate operators like that. We are more likely to get groups like
> > > (o_1, o_2) & (o_3, o_4), where the chain breaks only between o_2 and o_3.
> > >
> > >
> > > 3. It complicates the system by having two different mechanisms for
> > sharing
> > > managed memory in  a slot.
> > >
> > >
> > > - In FLIP-141, we introduced the intra-slot managed memory sharing
> > > mechanism, where managed memory is first distributed according to the
> > > consumer type, then further distributed across operators of that consumer
> > > type.
> > >
> > > - With the operator-based approach, managed memory size specified for an
> > > operator should account for all the consumer types of that operator. That
> > > means the managed memory is first distributed across operators, then
> > > distributed to different consumer types of each operator.
> > >
> > >
> > > Unfortunately, the different order of the two calculation steps can lead
> > to
> > > different results. To be specific, the semantic of the configuration
> > option
> > > `consumer-weights` changed (within a slot vs. within an operator).
> > >
> > >
> > >
> > > To sum up things:
> > >
> > > While (3) might be a bit more implementation related, I think (1) and (2)
> > > somehow suggest that, the price for the proposed approach to avoid
> > > specifying resource for every operator is that it's not as independent
> > from
> > > operator chaining and slot sharing as the operator-based approach
> > discussed
> > > in the FLIP.
> > >
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen <se...@apache.org> wrote:
> > >
> > > > Thanks a lot, Yangze and Xintong for this FLIP.
> > > >
> > > > I want to say, first of all, that this is super well written. And the
> > > > points that the FLIP makes about how to expose the configuration to
> > users
> > > > is exactly the right thing to figure out first.
> > > > So good job here!
> > > >
> > > > About how to let users specify the resource profiles. If I can sum the
> > > FLIP
> > > > and previous discussion up in my own words, the problem is the
> > following:
> > > >
> > > > Operator-level specification is the simplest and cleanest approach,
> > > because
> > > > > it avoids mixing operator configuration (resource) and scheduling. No
> > > > > matter what other parameters change (chaining, slot sharing,
> > switching
> > > > > pipelined and blocking shuffles), the resource profiles stay the
> > same.
> > > > > But it would require that a user specifies resources on all
> > operators,
> > > > > which makes it hard to use. That's why the FLIP suggests going with
> > > > > specifying resources on a Sharing-Group.
> > > >
> > > >
> > > > I think both thoughts are important, so can we find a solution where
> > the
> > > > Resource Profiles are specified on an Operator, but we still avoid that
> > > we
> > > > need to specify a resource profile on every operator?
> > > >
> > > > What do you think about something like the following:
> > > >   - Resource Profiles are specified on an operator level.
> > > >   - Not all operators need profiles
> > > >   - All Operators without a Resource Profile ended up in the default
> > slot
> > > > sharing group with a default profile (will get a default slot).
> > > >   - All Operators with a Resource Profile will go into another slot
> > > sharing
> > > > group (the resource-specified-group).
> > > >   - Users can define different slot sharing groups for operators like
> > > they
> > > > do now, with the exception that you cannot mix operators that have a
> > > > resource profile and operators that have no resource profile.
> > > >   - The default case where no operator has a resource profile is just a
> > > > special case of this model
> > > >   - The chaining logic sums up the profiles per operator, like it does
> > > now,
> > > > and the scheduler sums up the profiles of the tasks that it schedules
> > > > together.
> > > >
> > > >
> > > > There is another question about reactive scaling raised in the FLIP. I
> > > need
> > > > to think a bit about that. That is indeed a bit more tricky once we
> > have
> > > > slots of different sizes.
> > > > It is not clear then which of the different slot requests the
> > > > ResourceManager should fulfill when new resources (TMs) show up, or how
> > > the
> > > > JobManager redistributes the slots resources when resources (TMs)
> > > disappear
> > > > This question is pretty orthogonal, though, to the "how to specify the
> > > > resources".
> > > >
> > > >
> > > > Best,
> > > > Stephan
> > > >
> > > > On Fri, Jan 8, 2021 at 5:14 AM Xintong Song <tonysong...@gmail.com>
> > > wrote:
> > > >
> > > > > Thanks for drafting the FLIP and driving the discussion, Yangze.
> > > > > And Thanks for the feedback, Till and Chesnay.
> > > > >
> > > > > @Till,
> > > > >
> > > > > I agree that specifying requirements for SSGs means that SSGs need to
> > > be
> > > > > supported in fine-grained resource management, otherwise each
> > operator
> > > > > might use as many resources as the whole group. However, I cannot
> > think
> > > > of
> > > > > a strong reason for not supporting SSGs in fine-grained resource
> > > > > management.
> > > > >
> > > > >
> > > > > > Interestingly, if all operators have their resources properly
> > > > specified,
> > > > > > then slot sharing is no longer needed because Flink could slice off
> > > the
> > > > > > appropriately sized slots for every Task individually.
> > > > > >
> > > > >
> > > > > So for example, if we have a job consisting of two operator op_1 and
> > > op_2
> > > > > > where each op needs 100 MB of memory, we would then say that the
> > slot
> > > > > > sharing group needs 200 MB of memory to run. If we have a cluster
> > > with
> > > > 2
> > > > > > TMs with one slot of 100 MB each, then the system cannot run this
> > > job.
> > > > If
> > > > > > the resources were specified on an operator level, then the system
> > > > could
> > > > > > still make the decision to deploy op_1 to TM_1 and op_2 to TM_2.
> > > > >
> > > > >
> > > > > Couldn't agree more that if all operators' requirements are properly
> > > > > specified, slot sharing should be no longer needed. I think this
> > > exactly
> > > > > disproves the example. If we already know op_1 and op_2 each needs
> > 100
> > > MB
> > > > > of memory, why would we put them in the same group? If they are in
> > > > separate
> > > > > groups, with the proposed approach the system can freely deploy them
> > to
> > > > > either a 200 MB TM or two 100 MB TMs.
> > > > >
> > > > > Moreover, the precondition for not needing slot sharing is having
> > > > resource
> > > > > requirements properly specified for all operators. This is not always
> > > > > possible, and usually requires tremendous efforts. One of the
> > benefits
> > > > for
> > > > > SSG-based requirements is that it allows the user to freely decide
> > the
> > > > > granularity, thus efforts they want to pay. I would consider SSG in
> > > > > fine-grained resource management as a group of operators that the
> > user
> > > > > would like to specify the total resource for. There can be only one
> > > group
> > > > > in the job, 2~3 groups dividing the job into a few major parts, or as
> > > > many
> > > > > groups as the number of tasks/operators, depending on how
> > fine-grained
> > > > the
> > > > > user is able to specify the resources.
> > > > >
> > > > > Having to support SSGs might be a constraint. But given that all the
> > > > > current scheduler implementations already support SSGs, I tend to
> > think
> > > > > that as an acceptable price for the above discussed usability and
> > > > > flexibility.
> > > > >
> > > > > @Chesnay
> > > > >
> > > > > Will declaring them on slot sharing groups not also waste resources
> > if
> > > > the
> > > > > > parallelism of operators within that group are different?
> > > > > >
> > > > > Yes. It's a trade-off between usability and resource utilization. To
> > > > avoid
> > > > > such wasting, the user can define more groups, so that each group
> > > > contains
> > > > > less operators and the chance of having operators with different
> > > > > parallelism will be reduced. The price is to have more resource
> > > > > requirements to specify.
> > > > >
> > > > > It also seems like quite a hassle for users having to recalculate the
> > > > > > resource requirements if they change the slot sharing.
> > > > > > I'd think that it's not really workable for users that create a set
> > > of
> > > > > > re-usable operators which are mixed and matched in their
> > > applications;
> > > > > > managing the resources requirements in such a setting would be a
> > > > > > nightmare, and in the end would require operator-level requirements
> > > any
> > > > > > way.
> > > > > > In that sense, I'm not even sure whether it really increases
> > > usability.
> > > > > >
> > > > >
> > > > >    - As mentioned in my reply to Till's comment, there's no reason to
> > > put
> > > > >    multiple operators whose individual resource requirements are
> > > already
> > > > > known
> > > > >    into the same group in fine-grained resource management.
> > > > >    - Even an operator implementation is reused for multiple
> > > applications,
> > > > >    it does not guarantee the same resource requirements. During our
> > > years
> > > > > of
> > > > >    practices in Alibaba, with per-operator requirements specified for
> > > > > Blink's
> > > > >    fine-grained resource management, very few users (including our
> > > > > specialists
> > > > >    who are dedicated to supporting Blink users) are as experienced as
> > > to
> > > > >    accurately predict/estimate the operator resource requirements.
> > Most
> > > > > people
> > > > >    rely on the execution-time metrics (throughput, delay, cpu load,
> > > > memory
> > > > >    usage, GC pressure, etc.) to improve the specification.
> > > > >
> > > > > To sum up:
> > > > > If the user is capable of providing proper resource requirements for
> > > > every
> > > > > operator, that's definitely a good thing and we would not need to
> > rely
> > > on
> > > > > the SSGs. However, that shouldn't be a *must* for the fine-grained
> > > > resource
> > > > > management to work. For those users who are capable and do not like
> > > > having
> > > > > to set each operator to a separate SSG, I would be ok to have both
> > > > > SSG-based and operator-based runtime interfaces and to only fallback
> > to
> > > > the
> > > > > SSG requirements when the operator requirements are not specified.
> > > > However,
> > > > > as the first step, I think we should prioritise the use cases where
> > > users
> > > > > are not that experienced.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > > On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler <ches...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > Will declaring them on slot sharing groups not also waste resources
> > > if
> > > > > > the parallelism of operators within that group are different?
> > > > > >
> > > > > > It also seems like quite a hassle for users having to recalculate
> > the
> > > > > > resource requirements if they change the slot sharing.
> > > > > > I'd think that it's not really workable for users that create a set
> > > of
> > > > > > re-usable operators which are mixed and matched in their
> > > applications;
> > > > > > managing the resources requirements in such a setting would be a
> > > > > > nightmare, and in the end would require operator-level requirements
> > > any
> > > > > > way.
> > > > > > In that sense, I'm not even sure whether it really increases
> > > usability.
> > > > > >
> > > > > > My main worry is that it if we wire the runtime to work on SSGs
> > it's
> > > > > > gonna be difficult to implement more fine-grained approaches, which
> > > > > > would not be the case if, for the runtime, they are always defined
> > on
> > > > an
> > > > > > operator-level.
> > > > > >
> > > > > > On 1/7/2021 2:42 PM, Till Rohrmann wrote:
> > > > > > > Thanks for drafting this FLIP and starting this discussion
> > Yangze.
> > > > > > >
> > > > > > > I like that defining resource requirements on a slot sharing
> > group
> > > > > makes
> > > > > > > the overall setup easier and improves usability of resource
> > > > > requirements.
> > > > > > >
> > > > > > > What I do not like about it is that it changes slot sharing
> > groups
> > > > from
> > > > > > > being a scheduling hint to something which needs to be supported
> > in
> > > > > order
> > > > > > > to support fine grained resource requirements. So far, the idea
> > of
> > > > slot
> > > > > > > sharing groups was that it tells the system that a set of
> > operators
> > > > can
> > > > > > be
> > > > > > > deployed in the same slot. But the system still had the freedom
> > to
> > > > say
> > > > > > that
> > > > > > > it would rather place these tasks in different slots if it
> > wanted.
> > > If
> > > > > we
> > > > > > > now specify resource requirements on a per slot sharing group,
> > then
> > > > the
> > > > > > > only option for a scheduler which does not support slot sharing
> > > > groups
> > > > > is
> > > > > > > to say that every operator in this slot sharing group needs a
> > slot
> > > > with
> > > > > > the
> > > > > > > same resources as the whole group.
> > > > > > >
> > > > > > > So for example, if we have a job consisting of two operator op_1
> > > and
> > > > > op_2
> > > > > > > where each op needs 100 MB of memory, we would then say that the
> > > slot
> > > > > > > sharing group needs 200 MB of memory to run. If we have a cluster
> > > > with
> > > > > 2
> > > > > > > TMs with one slot of 100 MB each, then the system cannot run this
> > > > job.
> > > > > If
> > > > > > > the resources were specified on an operator level, then the
> > system
> > > > > could
> > > > > > > still make the decision to deploy op_1 to TM_1 and op_2 to TM_2.
> > > > > > >
> > > > > > > Originally, one of the primary goals of slot sharing groups was
> > to
> > > > make
> > > > > > it
> > > > > > > easier for the user to reason about how many slots a job needs
> > > > > > independent
> > > > > > > of the actual number of operators in the job. Interestingly, if
> > all
> > > > > > > operators have their resources properly specified, then slot
> > > sharing
> > > > is
> > > > > > no
> > > > > > > longer needed because Flink could slice off the appropriately
> > sized
> > > > > slots
> > > > > > > for every Task individually. What matters is whether the whole
> > > > cluster
> > > > > > has
> > > > > > > enough resources to run all tasks or not.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Till
> > > > > > >
> > > > > > > On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo <karma...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > >> Hi, there,
> > > > > > >>
> > > > > > >> We would like to start a discussion thread on "FLIP-156: Runtime
> > > > > > >> Interfaces for Fine-Grained Resource Requirements"[1], where we
> > > > > > >> propose Slot Sharing Group (SSG) based runtime interfaces for
> > > > > > >> specifying fine-grained resource requirements.
> > > > > > >>
> > > > > > >> In this FLIP:
> > > > > > >> - Expound the user story of fine-grained resource management.
> > > > > > >> - Propose runtime interfaces for specifying SSG-based resource
> > > > > > >> requirements.
> > > > > > >> - Discuss the pros and cons of the three potential granularities
> > > for
> > > > > > >> specifying the resource requirements (op, task and slot sharing
> > > > group)
> > > > > > >> and explain why we choose the slot sharing group.
> > > > > > >>
> > > > > > >> Please find more details in the FLIP wiki document [1]. Looking
> > > > > > >> forward to your feedback.
> > > > > > >>
> > > > > > >> [1]
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements
> > > > > > >>
> > > > > > >> Best,
> > > > > > >> Yangze Guo
> > > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >

Re: [DISCUSS] FLIP-156: Runtime Interfaces for Fine-Grained Resource Requirements

Reply via email to