One thing you could potentially look into is integrating the TaskExecutorManager into the FIneGrainedSlotManager or find another way to move the timeout handling into another component.

w.r.t. testing, there's the DeclarativeSlotManagerTest that covers some specific issues that should be ported to the fine grained slot manager.

On 01/03/2023 07:17, Xintong Song wrote:
Thanks Weihua for preparing this FLIP. +1 for the proposal.


As one of the contributors of the fine-grained slot manager, I'd like to
share some backgrounds here.

- There used to be a defaut slot manager implementation, which is
non-declarative and has been removed now. The two features, declarative /
reactive resource management and fine-grained resource management, were
proposed at about the same time. We were aware that by design the
declarative slot manager is a subset of fine-grained slot manager at that
time, but still decided to implement two slot managers for the purpose of
decoupling efforts and reducing cross-team synchronization overhead.
Merging the two slot managers once they are proved stable is IMO a
technical debt that was planned at the very beginning.

- The FineGrainedSlotManager has been verified in Alibaba's internal
production as well as Alibaba Cloud services as the default slot manager
for about 2 years.


Concerning test cases, we currently have a ci stage for fine grained
resource management. To avoid adding too much burden, the stage only
includes tests from flink-runtime and flink-test modules. I think switching
the default slot manager and applying the whole set of tests on the
fine-grained slot manager would help us to be more confident about it.


Concerning cutting out functionalities out of slot manager, I think Yangze
and I have tried our best to shape the FineGrainedSlotManager into
reasonable components. I personally don't have other ideas to further
disassemble the component, but I'm open to such suggestions. However, from
the stability perspective, I'd be in favor of not introducing significant
changes to the FineGrainedSlotManager while switching it to the default.
Because the current implementation has already been verified (or at least
partially verified because Alibaba does not cover all the Flink use cases),
and introducing more changes also means more chances of breaking things.


Best,

Xintong



On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zjur...@gmail.com> wrote:

Hi

Thanks for starting this work weihua, I think unifying
DeclarativeSlotManager and FineGrainedSlotManager is valuable.

I agree with @Matthias and @John that we need a way to ensure that
DeclarativeSlotManager's capabilities are fully covered by
FineGrainedSlotManager

1. For their functional differences, can you give some detailed tests to
verify that the new FineGrainedSlotManager has these capabilities? This can
effectively verify the new functions

2. I'm worried that many functions are not independent and it is difficult
to migrate step-by-step. You can list the relationship between them in
detail.

3. As John mentioned, give a smoke test for FineGrainedSlotManager is a
good idea. Or you can add some test information to the
DeclarativeSlotManager to determine how many tests have used it. In this
way, we can gradually construct test cases for FineGrainedSlotManager
during the development process.


Best,
Shammon


On Tue, Feb 28, 2023 at 10:22 PM John Roesler <vvcep...@apache.org> wrote:

Thanks for the FLIP, Weihua!

I’ve read the FLIP, and it sounds good to me. We need to avoid
proliferating alternative implementations wherever possible. I have just
a
couple of comments:

1. I share Matthias’s concern about ensuring the behavior is really the
same. One suggestion I’ve used for this kind of thing is, as a smoke
test,
to update the DeclarativeSlotManager to just delegate to the
FineGrainedSlotManager. If the full test suite still passes, you can be
pretty sure the new default is really ok. It would not be a good idea to
actually keep that in for the release, since it would remove the option
to
fall back in case of bugs. Either way, we need to make sure all test
scenarios are present for the FGSM.

4. In addition to changing the default, would it make sense to log a
deprecation warning on initialization if the DeclarativeSlotManager is
used?
Thanks again,
John

On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote:
Hi Weihua,
Thanks for your proposal. From a conceptual point: AFAIU, the
DeclarativeSlotManager covers a subset (i.e. only evenly sized slots)
of
what the FineGrainedSlotManager should be able to achieve (variable
slot
size per task manager). Is this the right assumption/understanding? In
this
sense, merging both implementations into a single one sounds good. A
few
more general comments, though:

1. Did you do a proper test coverage analysis? That's not mentioned in
the
current version of the FLIP. I'm bringing this up because we ran into
the
same issue when fixing the flaws that popped up after introducing the
multi-component leader election (see FLIP-285 [1]). There is a risk
that
by
removing the legacy code we decrease test coverage because certain
test cases that were covered for the legacy classes might not be
necessarily covered in the new implementation, yet (see FLINK-30338 [2]
which covers this issue for the leader election case). Ideally, we
don't
want to remove test cases accidentally because they were only
implemented
for the DeclarativeSlotManager but missed for the
FineGrainedSlotManager.
2. DeclarativeSlotManager and FineGrainedSlotManager feel quite big in
terms of lines of code. Without knowing whether it's actually a
reasonable
thing to do: Instead of just adding more features to the
FineGrainedSlotManager, have you thought of cutting out functionality
into
smaller sub-components along this refactoring? Such a step-by-step
approach
might improve the overall codebase and might make reviewing the
refactoring
easier. I did a first pass over the code and struggled to identify code
blocks that could be moved out of the SlotManager implementation(s).
Therefore, I might be wrong with this proposal. I haven't worked on
this
codebase in that detail that it would allow me to come up with a
judgement
call. I wanted to bring it up, anyway, because I'm curious whether that
could be an option. There's a comment created by Chesnay (CC'd) in the
JavaDoc of TaskExecutorManager [3] indicating something similar. I'm
wondering whether he can add some insights here.

3. For me personally, having a more detailed summary comparing the
subcomponents of both SlotManager implementations with where
their functionality matches and where they differ might help understand
the
consequences of the changes proposed in FLIP-298.

Best,
Matthias

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
[2] https://issues.apache.org/jira/browse/FLINK-30338
[3]

https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <wang...@163.com> wrote:

This is a good proposal for me, it will make the code of the
SlotManager
more clear.



--

Best,
Matt Wang


---- Replied Message ----
| From | David Morávek<d...@apache.org> |
| Date | 02/27/2023 22:45 |
| To | <dev@flink.apache.org> |
| Subject | Re: [DISCUSS] FLIP-298: Unifying the Implementation of
SlotManager |
Hi Weihua, I still need to dig into the details, but the overall
sentiment
of this change sounds reasonable.

Best,
D.

On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
zhanghao.c...@outlook.com>
wrote:

Thanks for driving this topic. I think this FLIP could help clean up
the
codebase to make it easier to maintain. +1 on it.

Best,
Zhanghao Chen
________________________________
From: Weihua Hu <huweihua....@gmail.com>
Sent: Monday, February 27, 2023 20:40
To: dev <dev@flink.apache.org>
Subject: [DISCUSS] FLIP-298: Unifying the Implementation of
SlotManager
Hi everyone,

I would like to begin a discussion on FLIP-298: Unifying the
Implementation
of SlotManager[1]. There are currently two types of SlotManager in
Flink:
DeclarativeSlotManager and FineGrainedSlotManager.
FineGrainedSlotManager
should behave as DeclarativeSlotManager if the user does not configure
the
slot request profile.

Therefore, this FLIP aims to unify the implementation of SlotManager
in
order to reduce maintenance costs.

Looking forward to hearing from you.

[1]



https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
Best,
Weihua



Reply via email to