Will declaring them on slot sharing groups not also waste resources if
the parallelism of operators within that group are different?
It also seems like quite a hassle for users having to recalculate the
resource requirements if they change the slot sharing.
I'd think that it's not really workable for users that create a set of
re-usable operators which are mixed and matched in their applications;
managing the resources requirements in such a setting would be a
nightmare, and in the end would require operator-level requirements any way.
In that sense, I'm not even sure whether it really increases usability.
My main worry is that it if we wire the runtime to work on SSGs it's
gonna be difficult to implement more fine-grained approaches, which
would not be the case if, for the runtime, they are always defined on an
operator-level.
On 1/7/2021 2:42 PM, Till Rohrmann wrote:
Thanks for drafting this FLIP and starting this discussion Yangze.
I like that defining resource requirements on a slot sharing group makes
the overall setup easier and improves usability of resource requirements.
What I do not like about it is that it changes slot sharing groups from
being a scheduling hint to something which needs to be supported in order
to support fine grained resource requirements. So far, the idea of slot
sharing groups was that it tells the system that a set of operators can be
deployed in the same slot. But the system still had the freedom to say that
it would rather place these tasks in different slots if it wanted. If we
now specify resource requirements on a per slot sharing group, then the
only option for a scheduler which does not support slot sharing groups is
to say that every operator in this slot sharing group needs a slot with the
same resources as the whole group.
So for example, if we have a job consisting of two operator op_1 and op_2
where each op needs 100 MB of memory, we would then say that the slot
sharing group needs 200 MB of memory to run. If we have a cluster with 2
TMs with one slot of 100 MB each, then the system cannot run this job. If
the resources were specified on an operator level, then the system could
still make the decision to deploy op_1 to TM_1 and op_2 to TM_2.
Originally, one of the primary goals of slot sharing groups was to make it
easier for the user to reason about how many slots a job needs independent
of the actual number of operators in the job. Interestingly, if all
operators have their resources properly specified, then slot sharing is no
longer needed because Flink could slice off the appropriately sized slots
for every Task individually. What matters is whether the whole cluster has
enough resources to run all tasks or not.
Cheers,
Till
On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo <karma...@gmail.com> wrote:
Hi, there,
We would like to start a discussion thread on "FLIP-156: Runtime
Interfaces for Fine-Grained Resource Requirements"[1], where we
propose Slot Sharing Group (SSG) based runtime interfaces for
specifying fine-grained resource requirements.
In this FLIP:
- Expound the user story of fine-grained resource management.
- Propose runtime interfaces for specifying SSG-based resource
requirements.
- Discuss the pros and cons of the three potential granularities for
specifying the resource requirements (op, task and slot sharing group)
and explain why we choose the slot sharing group.
Please find more details in the FLIP wiki document [1]. Looking
forward to your feedback.
[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements
Best,
Yangze Guo