Is there even a functional difference between specifying the requirements for an SSG vs specifying the same requirements on a single operator within that group (ideally a colocation group to avoid this whole hint business)?

Wouldn't we get the best of both worlds in the latter case?

Users can take shortcuts to define shared requirements,
but refine them further as needed on a per-operator basis,
without changing semantics of slotsharing groups
nor the runtime being locked into SSG-based requirements.

(And before anyone argues what happens if slotsharing groups change or whatnot, that's a plain API issue that we could surely solve. (A plain iteration over slotsharing groups and therein contained operators would suffice)).

On 1/20/2021 6:48 PM, Till Rohrmann wrote:
Maybe a different minor idea: Would it be possible to treat the SSG
resource requirements as a hint for the runtime similar to how slot sharing
groups are designed at the moment? Meaning that we don't give the guarantee
that Flink will always deploy this set of tasks together no matter what
comes. If, for example, the runtime can derive by some means the resource
requirements for each task based on the requirements for the SSG, this
could be possible. One easy strategy would be to give every task the same
resources as the whole slot sharing group. Another one could be
distributing the resources equally among the tasks. This does not even have
to be implemented but we would give ourselves the freedom to change
scheduling if need should arise.

Cheers,
Till

On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo <karma...@gmail.com> wrote:

Thanks for the responses, Till and Xintong.

I second Xintong's comment that SSG-based runtime interface will give
us the flexibility to achieve op/task-based approach. That's one of
the most important reasons for our design choice.

Some cents regarding the default operator resource:
- It might be good for the scenario of DataStream jobs.
    ** For light-weight operators, the accumulative configuration error
will not be significant. Then, the resource of a task used is
proportional to the number of operators it contains.
    ** For heavy operators like join and window or operators using the
external resources, user will turn to the fine-grained resource
configuration.
- It can increase the stability for the standalone cluster where task
executors registered are heterogeneous(with different default slot
resources).
- It might not be good for SQL users. The operators that SQL will be
transferred to is a black box to the user. We also do not guarantee
the cross-version of consistency of the transformation so far.

I think it can be treated as a follow-up work when the fine-grained
resource management is end-to-end ready.

Best,
Yangze Guo


On Wed, Jan 20, 2021 at 11:16 AM Xintong Song <tonysong...@gmail.com>
wrote:
Thanks for the feedback, Till.

## I feel that what you proposed (operator-based + default value) might
be
subsumed by the SSG-based approach.
Thinking of op_1 -> op_2, there are the following 4 cases, categorized by
whether the resource requirements are known to the users.

    1. *Both known.* As previously mentioned, there's no reason to put
    multiple operators whose individual resource requirements are already
known
    into the same group in fine-grained resource management. And if op_1
and
    op_2 are in different groups, there should be no problem switching
data
    exchange mode from pipelined to blocking. This is equivalent to
specifying
    operator resource requirements in your proposal.
    2. *op_1 known, op_2 unknown.* Similar to 1), except that op_2 is in a
    SSG whose resource is not specified thus would have the default slot
    resource. This is equivalent to having default operator resources in
your
    proposal.
    3. *Both unknown*. The user can either set op_1 and op_2 to the same
SSG
    or separate SSGs.
       - If op_1 and op_2 are in the same SSG, it will be equivalent to
the
       coarse-grained resource management, where op_1 and op_2 share a
default
       size slot no matter which data exchange mode is used.
       - If op_1 and op_2 are in different SSGs, then each of them will
use
       a default size slot. This is equivalent to setting them with
default
       operator resources in your proposal.
    4. *Total (pipeline) or max (blocking) of op_1 and op_2 is known.*
       - It is possible that the user learns the total / max resource
       requirement from executing and monitoring the job, while not
being aware of
       individual operator requirements.
       - I believe this is the case your proposal does not cover. And TBH,
       this is probably how most users learn the resource requirements,
according
       to my experiences.
       - In this case, the user might need to specify different resources
if
       he wants to switch the execution mode, which should not be worse
than not
       being able to use fine-grained resource management.


## An additional idea inspired by your proposal.
We may provide multiple options for deciding resources for SSGs whose
requirement is not specified, if needed.

    - Default slot resource (current design)
    - Default operator resource times number of operators (equivalent to
    your proposal)


## Exposing internal runtime strategies
Theoretically, yes. Tying to the SSGs, the resource requirements might be
affected if how SSGs are internally handled changes in future.
Practically,
I do not concretely see at the moment what kind of changes we may want in
future that might conflict with this FLIP proposal, as the question of
switching data exchange mode answered above. I'd suggest to not give up
the
user friendliness we may gain now for the future problems that may or may
not exist.

Moreover, the SSG-based approach has the flexibility to achieve the
equivalent behavior as the operator-based approach, if we set each
operator
(or task) to a separate SSG. We can even provide a shortcut option to
automatically do that for users, if needed.


Thank you~

Xintong Song



On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann <trohrm...@apache.org>
wrote:
Thanks for the responses Xintong and Stephan,

I agree that being able to define the resource requirements for a
group of
operators is more user friendly. However, my concern is that we are
exposing thereby internal runtime strategies which might limit our
flexibility to execute a given job. Moreover, the semantics of
configuring
resource requirements for SSGs could break if switching from streaming
to
batch execution. If one defines the resource requirements for op_1 ->
op_2
which run in pipelined mode when using the streaming execution, then
how do
we interpret these requirements when op_1 -> op_2 are executed with a
blocking data exchange in batch execution mode? Consequently, I am
still
leaning towards Stephan's proposal to set the resource requirements per
operator.

Maybe the following proposal makes the configuration easier: If the
user
wants to use fine-grained resource requirements, then she needs to
specify
the default size which is used for operators which have no explicit
resource annotation. If this holds true, then every operator would
have a
resource requirement and the system can try to execute the operators
in the
best possible manner w/o being constrained by how the user set the SSG
requirements.

Cheers,
Till

On Tue, Jan 19, 2021 at 9:09 AM Xintong Song <tonysong...@gmail.com>
wrote:

Thanks for the feedback, Stephan.

Actually, your proposal has also come to my mind at some point. And I
have
some concerns about it.


1. It does not give users the same control as the SSG-based approach.


While both approaches do not require specifying for each operator,
SSG-based approach supports the semantic that "some operators
together
use
this much resource" while the operator-based approach doesn't.


Think of a long pipeline with m operators (o_1, o_2, ..., o_m), and
at
some
point there's an agg o_n (1 < n < m) which significantly reduces the
data
amount. One can separate the pipeline into 2 groups SSG_1 (o_1, ...,
o_n)
and SSG_2 (o_n+1, ... o_m), so that configuring much higher
parallelisms
for operators in SSG_1 than for operators in SSG_2 won't lead to too
much
wasting of resources. If the two SSGs end up needing different
resources,
with the SSG-based approach one can directly specify resources for
the
two
groups. However, with the operator-based approach, the user will
have to
specify resources for each operator in one of the two groups, and
tune
the
default slot resource via configurations to fit the other group.


2. It increases the chance of breaking operator chains.


Setting chainnable operators into different slot sharing groups will
prevent them from being chained. In the current implementation,
downstream
operators, if SSG not explicitly specified, will be set to the same
group
as the chainable upstream operators (unless multiple upstream
operators
in
different groups), to reduce the chance of breaking chains.


Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, deciding
SSGs
based on whether resource is specified we will easily get groups like
(o_1,
o_3) & (o_2, o_4), where none of the operators can be chained. This
is
also
possible for the SSG-based approach, but I believe the chance is much
smaller because there's no strong reason for users to specify the
groups
with alternate operators like that. We are more likely to get groups
like
(o_1, o_2) & (o_3, o_4), where the chain breaks only between o_2 and
o_3.

3. It complicates the system by having two different mechanisms for
sharing
managed memory in  a slot.


- In FLIP-141, we introduced the intra-slot managed memory sharing
mechanism, where managed memory is first distributed according to the
consumer type, then further distributed across operators of that
consumer
type.

- With the operator-based approach, managed memory size specified
for an
operator should account for all the consumer types of that operator.
That
means the managed memory is first distributed across operators, then
distributed to different consumer types of each operator.


Unfortunately, the different order of the two calculation steps can
lead
to
different results. To be specific, the semantic of the configuration
option
`consumer-weights` changed (within a slot vs. within an operator).



To sum up things:

While (3) might be a bit more implementation related, I think (1)
and (2)
somehow suggest that, the price for the proposed approach to avoid
specifying resource for every operator is that it's not as
independent
from
operator chaining and slot sharing as the operator-based approach
discussed
in the FLIP.


Thank you~

Xintong Song



On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen <se...@apache.org>
wrote:
Thanks a lot, Yangze and Xintong for this FLIP.

I want to say, first of all, that this is super well written. And
the
points that the FLIP makes about how to expose the configuration to
users
is exactly the right thing to figure out first.
So good job here!

About how to let users specify the resource profiles. If I can sum
the
FLIP
and previous discussion up in my own words, the problem is the
following:
Operator-level specification is the simplest and cleanest approach,
because
it avoids mixing operator configuration (resource) and
scheduling. No
matter what other parameters change (chaining, slot sharing,
switching
pipelined and blocking shuffles), the resource profiles stay the
same.
But it would require that a user specifies resources on all
operators,
which makes it hard to use. That's why the FLIP suggests going
with
specifying resources on a Sharing-Group.

I think both thoughts are important, so can we find a solution
where
the
Resource Profiles are specified on an Operator, but we still avoid
that
we
need to specify a resource profile on every operator?

What do you think about something like the following:
   - Resource Profiles are specified on an operator level.
   - Not all operators need profiles
   - All Operators without a Resource Profile ended up in the
default
slot
sharing group with a default profile (will get a default slot).
   - All Operators with a Resource Profile will go into another slot
sharing
group (the resource-specified-group).
   - Users can define different slot sharing groups for operators
like
they
do now, with the exception that you cannot mix operators that have
a
resource profile and operators that have no resource profile.
   - The default case where no operator has a resource profile is
just a
special case of this model
   - The chaining logic sums up the profiles per operator, like it
does
now,
and the scheduler sums up the profiles of the tasks that it
schedules
together.


There is another question about reactive scaling raised in the
FLIP. I
need
to think a bit about that. That is indeed a bit more tricky once we
have
slots of different sizes.
It is not clear then which of the different slot requests the
ResourceManager should fulfill when new resources (TMs) show up,
or how
the
JobManager redistributes the slots resources when resources (TMs)
disappear
This question is pretty orthogonal, though, to the "how to specify
the
resources".


Best,
Stephan

On Fri, Jan 8, 2021 at 5:14 AM Xintong Song <tonysong...@gmail.com
wrote:
Thanks for drafting the FLIP and driving the discussion, Yangze.
And Thanks for the feedback, Till and Chesnay.

@Till,

I agree that specifying requirements for SSGs means that SSGs
need to
be
supported in fine-grained resource management, otherwise each
operator
might use as many resources as the whole group. However, I cannot
think
of
a strong reason for not supporting SSGs in fine-grained resource
management.


Interestingly, if all operators have their resources properly
specified,
then slot sharing is no longer needed because Flink could
slice off
the
appropriately sized slots for every Task individually.

So for example, if we have a job consisting of two operator op_1
and
op_2
where each op needs 100 MB of memory, we would then say that
the
slot
sharing group needs 200 MB of memory to run. If we have a
cluster
with
2
TMs with one slot of 100 MB each, then the system cannot run
this
job.
If
the resources were specified on an operator level, then the
system
could
still make the decision to deploy op_1 to TM_1 and op_2 to
TM_2.

Couldn't agree more that if all operators' requirements are
properly
specified, slot sharing should be no longer needed. I think this
exactly
disproves the example. If we already know op_1 and op_2 each
needs
100
MB
of memory, why would we put them in the same group? If they are
in
separate
groups, with the proposed approach the system can freely deploy
them
to
either a 200 MB TM or two 100 MB TMs.

Moreover, the precondition for not needing slot sharing is having
resource
requirements properly specified for all operators. This is not
always
possible, and usually requires tremendous efforts. One of the
benefits
for
SSG-based requirements is that it allows the user to freely
decide
the
granularity, thus efforts they want to pay. I would consider SSG
in
fine-grained resource management as a group of operators that the
user
would like to specify the total resource for. There can be only
one
group
in the job, 2~3 groups dividing the job into a few major parts,
or as
many
groups as the number of tasks/operators, depending on how
fine-grained
the
user is able to specify the resources.

Having to support SSGs might be a constraint. But given that all
the
current scheduler implementations already support SSGs, I tend to
think
that as an acceptable price for the above discussed usability and
flexibility.

@Chesnay

Will declaring them on slot sharing groups not also waste
resources
if
the
parallelism of operators within that group are different?

Yes. It's a trade-off between usability and resource
utilization. To
avoid
such wasting, the user can define more groups, so that each group
contains
less operators and the chance of having operators with different
parallelism will be reduced. The price is to have more resource
requirements to specify.

It also seems like quite a hassle for users having to
recalculate the
resource requirements if they change the slot sharing.
I'd think that it's not really workable for users that create
a set
of
re-usable operators which are mixed and matched in their
applications;
managing the resources requirements in such a setting would be
a
nightmare, and in the end would require operator-level
requirements
any
way.
In that sense, I'm not even sure whether it really increases
usability.
    - As mentioned in my reply to Till's comment, there's no
reason to
put
    multiple operators whose individual resource requirements are
already
known
    into the same group in fine-grained resource management.
    - Even an operator implementation is reused for multiple
applications,
    it does not guarantee the same resource requirements. During
our
years
of
    practices in Alibaba, with per-operator requirements
specified for
Blink's
    fine-grained resource management, very few users (including
our
specialists
    who are dedicated to supporting Blink users) are as
experienced as
to
    accurately predict/estimate the operator resource
requirements.
Most
people
    rely on the execution-time metrics (throughput, delay, cpu
load,
memory
    usage, GC pressure, etc.) to improve the specification.

To sum up:
If the user is capable of providing proper resource requirements
for
every
operator, that's definitely a good thing and we would not need to
rely
on
the SSGs. However, that shouldn't be a *must* for the
fine-grained
resource
management to work. For those users who are capable and do not
like
having
to set each operator to a separate SSG, I would be ok to have
both
SSG-based and operator-based runtime interfaces and to only
fallback
to
the
SSG requirements when the operator requirements are not
specified.
However,
as the first step, I think we should prioritise the use cases
where
users
are not that experienced.

Thank you~

Xintong Song

On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler <
ches...@apache.org>
wrote:

Will declaring them on slot sharing groups not also waste
resources
if
the parallelism of operators within that group are different?

It also seems like quite a hassle for users having to
recalculate
the
resource requirements if they change the slot sharing.
I'd think that it's not really workable for users that create
a set
of
re-usable operators which are mixed and matched in their
applications;
managing the resources requirements in such a setting would be
a
nightmare, and in the end would require operator-level
requirements
any
way.
In that sense, I'm not even sure whether it really increases
usability.
My main worry is that it if we wire the runtime to work on SSGs
it's
gonna be difficult to implement more fine-grained approaches,
which
would not be the case if, for the runtime, they are always
defined
on
an
operator-level.

On 1/7/2021 2:42 PM, Till Rohrmann wrote:
Thanks for drafting this FLIP and starting this discussion
Yangze.
I like that defining resource requirements on a slot sharing
group
makes
the overall setup easier and improves usability of resource
requirements.
What I do not like about it is that it changes slot sharing
groups
from
being a scheduling hint to something which needs to be
supported
in
order
to support fine grained resource requirements. So far, the
idea
of
slot
sharing groups was that it tells the system that a set of
operators
can
be
deployed in the same slot. But the system still had the
freedom
to
say
that
it would rather place these tasks in different slots if it
wanted.
If
we
now specify resource requirements on a per slot sharing
group,
then
the
only option for a scheduler which does not support slot
sharing
groups
is
to say that every operator in this slot sharing group needs a
slot
with
the
same resources as the whole group.

So for example, if we have a job consisting of two operator
op_1
and
op_2
where each op needs 100 MB of memory, we would then say that
the
slot
sharing group needs 200 MB of memory to run. If we have a
cluster
with
2
TMs with one slot of 100 MB each, then the system cannot run
this
job.
If
the resources were specified on an operator level, then the
system
could
still make the decision to deploy op_1 to TM_1 and op_2 to
TM_2.
Originally, one of the primary goals of slot sharing groups
was
to
make
it
easier for the user to reason about how many slots a job
needs
independent
of the actual number of operators in the job. Interestingly,
if
all
operators have their resources properly specified, then slot
sharing
is
no
longer needed because Flink could slice off the appropriately
sized
slots
for every Task individually. What matters is whether the
whole
cluster
has
enough resources to run all tasks or not.

Cheers,
Till

On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo <
karma...@gmail.com>
wrote:
Hi, there,

We would like to start a discussion thread on "FLIP-156:
Runtime
Interfaces for Fine-Grained Resource Requirements"[1],
where we
propose Slot Sharing Group (SSG) based runtime interfaces
for
specifying fine-grained resource requirements.

In this FLIP:
- Expound the user story of fine-grained resource
management.
- Propose runtime interfaces for specifying SSG-based
resource
requirements.
- Discuss the pros and cons of the three potential
granularities
for
specifying the resource requirements (op, task and slot
sharing
group)
and explain why we choose the slot sharing group.

Please find more details in the FLIP wiki document [1].
Looking
forward to your feedback.

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements
Best,
Yangze Guo



Reply via email to