Re: [DISCUSS] FLIP-156: Runtime Interfaces for Fine-Grained Resource Requirements

Chesnay Schepler Wed, 20 Jan 2021 18:25:54 -0800

Is there even a functional difference between specifying therequirements for an SSG vs specifying the same requirements on a singleoperator within that group (ideally a colocation group to avoid thiswhole hint business)?


Wouldn't we get the best of both worlds in the latter case?


Users can take shortcuts to define shared requirements,
but refine them further as needed on a per-operator basis,
without changing semantics of slotsharing groups
nor the runtime being locked into SSG-based requirements.

(And before anyone argues what happens if slotsharing groups change orwhatnot, that's a plain API issue that we could surely solve. (A plainiteration over slotsharing groups and therein contained operators wouldsuffice)).


On 1/20/2021 6:48 PM, Till Rohrmann wrote:

Maybe a different minor idea: Would it be possible to treat the SSG
resource requirements as a hint for the runtime similar to how slot sharing
groups are designed at the moment? Meaning that we don't give the guarantee
that Flink will always deploy this set of tasks together no matter what
comes. If, for example, the runtime can derive by some means the resource
requirements for each task based on the requirements for the SSG, this
could be possible. One easy strategy would be to give every task the same
resources as the whole slot sharing group. Another one could be
distributing the resources equally among the tasks. This does not even have
to be implemented but we would give ourselves the freedom to change
scheduling if need should arise.

Cheers,
Till

On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo <karma...@gmail.com> wrote:

Thanks for the responses, Till and Xintong.

I second Xintong's comment that SSG-based runtime interface will give
us the flexibility to achieve op/task-based approach. That's one of
the most important reasons for our design choice.

Some cents regarding the default operator resource:
- It might be good for the scenario of DataStream jobs.
    ** For light-weight operators, the accumulative configuration error
will not be significant. Then, the resource of a task used is
proportional to the number of operators it contains.
    ** For heavy operators like join and window or operators using the
external resources, user will turn to the fine-grained resource
configuration.
- It can increase the stability for the standalone cluster where task
executors registered are heterogeneous(with different default slot
resources).
- It might not be good for SQL users. The operators that SQL will be
transferred to is a black box to the user. We also do not guarantee
the cross-version of consistency of the transformation so far.

I think it can be treated as a follow-up work when the fine-grained
resource management is end-to-end ready.

Best,
Yangze Guo


On Wed, Jan 20, 2021 at 11:16 AM Xintong Song <tonysong...@gmail.com>
wrote:

Thanks for the feedback, Till.

## I feel that what you proposed (operator-based + default value) might

be

subsumed by the SSG-based approach.
Thinking of op_1 -> op_2, there are the following 4 cases, categorized by
whether the resource requirements are known to the users.

    1. *Both known.* As previously mentioned, there's no reason to put
    multiple operators whose individual resource requirements are already

known

    into the same group in fine-grained resource management. And if op_1

and

    op_2 are in different groups, there should be no problem switching

data

    exchange mode from pipelined to blocking. This is equivalent to

specifying

    operator resource requirements in your proposal.
    2. *op_1 known, op_2 unknown.* Similar to 1), except that op_2 is in a
    SSG whose resource is not specified thus would have the default slot
    resource. This is equivalent to having default operator resources in

your

    proposal.
    3. *Both unknown*. The user can either set op_1 and op_2 to the same

SSG

    or separate SSGs.
       - If op_1 and op_2 are in the same SSG, it will be equivalent to

the

       coarse-grained resource management, where op_1 and op_2 share a

default

       size slot no matter which data exchange mode is used.
       - If op_1 and op_2 are in different SSGs, then each of them will

use

       a default size slot. This is equivalent to setting them with

default

       operator resources in your proposal.
    4. *Total (pipeline) or max (blocking) of op_1 and op_2 is known.*
       - It is possible that the user learns the total / max resource
       requirement from executing and monitoring the job, while not
being aware of
       individual operator requirements.
       - I believe this is the case your proposal does not cover. And TBH,
       this is probably how most users learn the resource requirements,
according
       to my experiences.
       - In this case, the user might need to specify different resources

if

       he wants to switch the execution mode, which should not be worse

than not

       being able to use fine-grained resource management.


## An additional idea inspired by your proposal.
We may provide multiple options for deciding resources for SSGs whose
requirement is not specified, if needed.

    - Default slot resource (current design)
    - Default operator resource times number of operators (equivalent to
    your proposal)


## Exposing internal runtime strategies
Theoretically, yes. Tying to the SSGs, the resource requirements might be
affected if how SSGs are internally handled changes in future.

Practically,

I do not concretely see at the moment what kind of changes we may want in
future that might conflict with this FLIP proposal, as the question of
switching data exchange mode answered above. I'd suggest to not give up

the

user friendliness we may gain now for the future problems that may or may
not exist.

Moreover, the SSG-based approach has the flexibility to achieve the
equivalent behavior as the operator-based approach, if we set each

operator

(or task) to a separate SSG. We can even provide a shortcut option to
automatically do that for users, if needed.


Thank you~

Xintong Song



On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann <trohrm...@apache.org>

wrote:

Thanks for the responses Xintong and Stephan,

I agree that being able to define the resource requirements for a

group of

operators is more user friendly. However, my concern is that we are
exposing thereby internal runtime strategies which might limit our
flexibility to execute a given job. Moreover, the semantics of

configuring

resource requirements for SSGs could break if switching from streaming

to

batch execution. If one defines the resource requirements for op_1 ->

op_2

which run in pipelined mode when using the streaming execution, then

how do

we interpret these requirements when op_1 -> op_2 are executed with a
blocking data exchange in batch execution mode? Consequently, I am

still

leaning towards Stephan's proposal to set the resource requirements per
operator.

Maybe the following proposal makes the configuration easier: If the

user

wants to use fine-grained resource requirements, then she needs to

specify

the default size which is used for operators which have no explicit
resource annotation. If this holds true, then every operator would

have a

resource requirement and the system can try to execute the operators

in the

best possible manner w/o being constrained by how the user set the SSG
requirements.

Cheers,
Till

On Tue, Jan 19, 2021 at 9:09 AM Xintong Song <tonysong...@gmail.com>
wrote:

Thanks for the feedback, Stephan.

Actually, your proposal has also come to my mind at some point. And I

have

some concerns about it.


1. It does not give users the same control as the SSG-based approach.


While both approaches do not require specifying for each operator,
SSG-based approach supports the semantic that "some operators

together

use

this much resource" while the operator-based approach doesn't.


Think of a long pipeline with m operators (o_1, o_2, ..., o_m), and

at

some

point there's an agg o_n (1 < n < m) which significantly reduces the

data

amount. One can separate the pipeline into 2 groups SSG_1 (o_1, ...,

o_n)

and SSG_2 (o_n+1, ... o_m), so that configuring much higher

parallelisms

for operators in SSG_1 than for operators in SSG_2 won't lead to too

much

wasting of resources. If the two SSGs end up needing different

resources,

with the SSG-based approach one can directly specify resources for

the

two

groups. However, with the operator-based approach, the user will

have to

specify resources for each operator in one of the two groups, and

tune

the

default slot resource via configurations to fit the other group.


2. It increases the chance of breaking operator chains.


Setting chainnable operators into different slot sharing groups will
prevent them from being chained. In the current implementation,

downstream

operators, if SSG not explicitly specified, will be set to the same

group

as the chainable upstream operators (unless multiple upstream

operators

in

different groups), to reduce the chance of breaking chains.


Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, deciding

SSGs

based on whether resource is specified we will easily get groups like

(o_1,

o_3) & (o_2, o_4), where none of the operators can be chained. This

is

also

possible for the SSG-based approach, but I believe the chance is much
smaller because there's no strong reason for users to specify the

groups

with alternate operators like that. We are more likely to get groups

like

(o_1, o_2) & (o_3, o_4), where the chain breaks only between o_2 and

o_3.


3. It complicates the system by having two different mechanisms for

sharing

managed memory in  a slot.


- In FLIP-141, we introduced the intra-slot managed memory sharing
mechanism, where managed memory is first distributed according to the
consumer type, then further distributed across operators of that

consumer

type.

- With the operator-based approach, managed memory size specified

for an

operator should account for all the consumer types of that operator.

That

means the managed memory is first distributed across operators, then
distributed to different consumer types of each operator.


Unfortunately, the different order of the two calculation steps can

lead

to

different results. To be specific, the semantic of the configuration

option

`consumer-weights` changed (within a slot vs. within an operator).



To sum up things:

While (3) might be a bit more implementation related, I think (1)

and (2)

somehow suggest that, the price for the proposed approach to avoid
specifying resource for every operator is that it's not as

independent

from

operator chaining and slot sharing as the operator-based approach

discussed

in the FLIP.


Thank you~

Xintong Song



On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen <se...@apache.org>

wrote:

Thanks a lot, Yangze and Xintong for this FLIP.

I want to say, first of all, that this is super well written. And

the

points that the FLIP makes about how to expose the configuration to

users

is exactly the right thing to figure out first.
So good job here!

About how to let users specify the resource profiles. If I can sum

the

FLIP

and previous discussion up in my own words, the problem is the

following:

Operator-level specification is the simplest and cleanest approach,

because

it avoids mixing operator configuration (resource) and

scheduling. No

matter what other parameters change (chaining, slot sharing,

switching

pipelined and blocking shuffles), the resource profiles stay the

same.

But it would require that a user specifies resources on all

operators,

which makes it hard to use. That's why the FLIP suggests going

with

specifying resources on a Sharing-Group.


I think both thoughts are important, so can we find a solution

where

the

Resource Profiles are specified on an Operator, but we still avoid

that

we

need to specify a resource profile on every operator?

What do you think about something like the following:
   - Resource Profiles are specified on an operator level.
   - Not all operators need profiles
   - All Operators without a Resource Profile ended up in the

default

slot

sharing group with a default profile (will get a default slot).
   - All Operators with a Resource Profile will go into another slot

sharing

group (the resource-specified-group).
   - Users can define different slot sharing groups for operators

like

they

do now, with the exception that you cannot mix operators that have

resource profile and operators that have no resource profile.
   - The default case where no operator has a resource profile is

just a

special case of this model
   - The chaining logic sums up the profiles per operator, like it

does

now,

and the scheduler sums up the profiles of the tasks that it

schedules

together.


There is another question about reactive scaling raised in the

FLIP. I

need

to think a bit about that. That is indeed a bit more tricky once we

have

slots of different sizes.
It is not clear then which of the different slot requests the
ResourceManager should fulfill when new resources (TMs) show up,

or how

the

JobManager redistributes the slots resources when resources (TMs)

disappear

This question is pretty orthogonal, though, to the "how to specify

the

resources".


Best,
Stephan

On Fri, Jan 8, 2021 at 5:14 AM Xintong Song <tonysong...@gmail.com

wrote:

Thanks for drafting the FLIP and driving the discussion, Yangze.
And Thanks for the feedback, Till and Chesnay.

@Till,

I agree that specifying requirements for SSGs means that SSGs

need to

be

supported in fine-grained resource management, otherwise each

operator

might use as many resources as the whole group. However, I cannot

think

of

a strong reason for not supporting SSGs in fine-grained resource
management.

Interestingly, if all operators have their resources properly

specified,

then slot sharing is no longer needed because Flink could

slice off

the

appropriately sized slots for every Task individually.

So for example, if we have a job consisting of two operator op_1

and

op_2

where each op needs 100 MB of memory, we would then say that

the

slot

sharing group needs 200 MB of memory to run. If we have a

cluster

with

TMs with one slot of 100 MB each, then the system cannot run

this

job.

If

the resources were specified on an operator level, then the

system

could

still make the decision to deploy op_1 to TM_1 and op_2 to

TM_2.


Couldn't agree more that if all operators' requirements are

properly

specified, slot sharing should be no longer needed. I think this

exactly

disproves the example. If we already know op_1 and op_2 each

needs

MB

of memory, why would we put them in the same group? If they are

in

separate

groups, with the proposed approach the system can freely deploy

them

to

either a 200 MB TM or two 100 MB TMs.

Moreover, the precondition for not needing slot sharing is having

resource

requirements properly specified for all operators. This is not

always

possible, and usually requires tremendous efforts. One of the

benefits

for

SSG-based requirements is that it allows the user to freely

decide

the

granularity, thus efforts they want to pay. I would consider SSG

in

fine-grained resource management as a group of operators that the

user

would like to specify the total resource for. There can be only

one

group

in the job, 2~3 groups dividing the job into a few major parts,

or as

many

groups as the number of tasks/operators, depending on how

fine-grained

the

user is able to specify the resources.

Having to support SSGs might be a constraint. But given that all

the

current scheduler implementations already support SSGs, I tend to

think

that as an acceptable price for the above discussed usability and
flexibility.

@Chesnay

Will declaring them on slot sharing groups not also waste

resources

if

the

parallelism of operators within that group are different?

Yes. It's a trade-off between usability and resource

utilization. To

avoid

such wasting, the user can define more groups, so that each group

contains

less operators and the chance of having operators with different
parallelism will be reduced. The price is to have more resource
requirements to specify.

It also seems like quite a hassle for users having to

recalculate the

resource requirements if they change the slot sharing.
I'd think that it's not really workable for users that create

a set

of

re-usable operators which are mixed and matched in their

applications;

managing the resources requirements in such a setting would be

nightmare, and in the end would require operator-level

requirements

any

way.
In that sense, I'm not even sure whether it really increases

usability.

    - As mentioned in my reply to Till's comment, there's no

reason to

put

    multiple operators whose individual resource requirements are

already

known
    into the same group in fine-grained resource management.
    - Even an operator implementation is reused for multiple

applications,

    it does not guarantee the same resource requirements. During

our

years

of
    practices in Alibaba, with per-operator requirements

specified for

Blink's
    fine-grained resource management, very few users (including

our

specialists
    who are dedicated to supporting Blink users) are as

experienced as

to

    accurately predict/estimate the operator resource

requirements.

Most

people
    rely on the execution-time metrics (throughput, delay, cpu

load,

memory

    usage, GC pressure, etc.) to improve the specification.

To sum up:
If the user is capable of providing proper resource requirements

for

every

operator, that's definitely a good thing and we would not need to

rely

on

the SSGs. However, that shouldn't be a *must* for the

fine-grained

resource

management to work. For those users who are capable and do not

like

having

to set each operator to a separate SSG, I would be ok to have

both

SSG-based and operator-based runtime interfaces and to only

fallback

to

the

SSG requirements when the operator requirements are not

specified.

However,

as the first step, I think we should prioritise the use cases

where

users

are not that experienced.

Thank you~

Xintong Song

On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler <

ches...@apache.org>

wrote:

Will declaring them on slot sharing groups not also waste

resources

if

the parallelism of operators within that group are different?

It also seems like quite a hassle for users having to

recalculate

the

resource requirements if they change the slot sharing.
I'd think that it's not really workable for users that create

a set

of

re-usable operators which are mixed and matched in their

applications;

managing the resources requirements in such a setting would be

nightmare, and in the end would require operator-level

requirements

any

way.
In that sense, I'm not even sure whether it really increases

usability.

My main worry is that it if we wire the runtime to work on SSGs

it's

gonna be difficult to implement more fine-grained approaches,

which

would not be the case if, for the runtime, they are always

defined

on

an

operator-level.

On 1/7/2021 2:42 PM, Till Rohrmann wrote:

Thanks for drafting this FLIP and starting this discussion

Yangze.

I like that defining resource requirements on a slot sharing

group

makes

the overall setup easier and improves usability of resource

requirements.

What I do not like about it is that it changes slot sharing

groups

from

being a scheduling hint to something which needs to be

supported

in

order

to support fine grained resource requirements. So far, the

idea

of

slot

sharing groups was that it tells the system that a set of

operators

can

be

deployed in the same slot. But the system still had the

freedom

to

say

that

it would rather place these tasks in different slots if it

wanted.

If

we

now specify resource requirements on a per slot sharing

group,

then

the

only option for a scheduler which does not support slot

sharing

groups

is

to say that every operator in this slot sharing group needs a

slot

with

the

same resources as the whole group.

So for example, if we have a job consisting of two operator

op_1

and

op_2

where each op needs 100 MB of memory, we would then say that

the

slot

sharing group needs 200 MB of memory to run. If we have a

cluster

with

TMs with one slot of 100 MB each, then the system cannot run

this

job.

If

the resources were specified on an operator level, then the

system

could

still make the decision to deploy op_1 to TM_1 and op_2 to

TM_2.

Originally, one of the primary goals of slot sharing groups

was

to

make

it

easier for the user to reason about how many slots a job

needs

independent

of the actual number of operators in the job. Interestingly,

if

all

operators have their resources properly specified, then slot

sharing

is

no

longer needed because Flink could slice off the appropriately

sized

slots

for every Task individually. What matters is whether the

whole

cluster

has

enough resources to run all tasks or not.

Cheers,
Till

On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo <

karma...@gmail.com>

wrote:

Hi, there,

We would like to start a discussion thread on "FLIP-156:

Runtime

Interfaces for Fine-Grained Resource Requirements"[1],

where we

propose Slot Sharing Group (SSG) based runtime interfaces

for

specifying fine-grained resource requirements.

In this FLIP:
- Expound the user story of fine-grained resource

management.

- Propose runtime interfaces for specifying SSG-based

resource

requirements.
- Discuss the pros and cons of the three potential

granularities

for

specifying the resource requirements (op, task and slot

sharing

group)

and explain why we choose the slot sharing group.

Please find more details in the FLIP wiki document [1].

Looking

forward to your feedback.

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements

Best,
Yangze Guo

Re: [DISCUSS] FLIP-156: Runtime Interfaces for Fine-Grained Resource Requirements

Reply via email to