Thanks Weihua for preparing this FLIP.

This FLIP overall looks reasonable to me after updating as suggested by
Matthias.

I only have one small question about keeping some redundant task managers:
In the fine-grained resource management, theoretically, it can support
heterogeneous taskmanagers. When we complete the missing features for FGSM,
do we plan to take this into account?
Of course, if I remember correctly, FGSM will not request heterogeneous
resources at present, so it is also acceptable to me if there is no special
treatment now.

+1 for this changes if we can ensure the test coverage.

Best regards,

Weijie


John Roesler <vvcep...@apache.org> 于2023年3月2日周四 12:53写道:

> Thanks for the test plan, Weihua!
>
> Yes, it addresses my concerns.
>
> Thanks,
> John
>
> On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote:
> > Hi, everyone,
> > Thanks for your suggestions and ideas.
> > Thanks Xintong for sharing the detailed backgrounds of SlotManager.
> >
> > *@Matthias
> >
> > 1. Did you do a proper test coverage analysis?
> >
> >
> > Just as Xintong said, we already have a CI stage for fine grained
> resource
> > managers.
> > And I will make sure FineGrainedSlotManager as the default SlotManager
> can
> > pass all the tests of CI.
> > In addition, I will review all unit tests of DeclarativeSlotManager(DSM)
> to
> > ensure that there are no gaps in the
> > coverage provided by the FineGrainedSlotManager.
> > I also added the 'Test Plan' part to the FLIP.
> > @Matthias @John @Shammon Does this test plan address your concerns?
> >
> > 2.  DeclarativeSlotManager and FineGrainedSlotManager feel quite big in
> >
> > terms of lines of code....
> >
> >
> > IMO, the refactoring of SlotManager does not belong to this FLIP since it
> > may lead to some unstable risks. For
> > FineGrainedSlotManager(FGSM), we already split some reasonable
> components.
> > They are:
> > * TaskManagerTracker: Track task managers and their resources.
> > * ResourceTracker: track requirements of jobs
> > * ResourceAllocationStrategy: Try to fulfill the resource requirements
> with
> > available/pending resources.
> > * SlotStatusSyncer: communicate with TaskManager, for allocating/freeing
> > slot and reconciling the slot status
> > Maybe we can start a discussion about refactoring SlotManager in another
> > FLIP if there are some good suggestions.
> > WDYT
> >
> > 3. For me personally, having a more detailed summary comparing the
> >> subcomponents of both SlotManager implementations with where
> >> their functionality matches and where they differ might help understand
> the
> >> consequences of the changes proposed in FLIP-298
> >
> > Good suggestion, I have updated the comparison in this FLIP. Looking
> > forward to any suggestions/thoughts
> > if they are not described clearly.
> >
> > *@John
> >
> > 4. In addition to changing the default, would it make sense to log a
> >> deprecation warning on initialization
> >
> > if the DeclarativeSlotManager is used?
> >>
> > SGTM, We should add Deprecated annotations to DSM for devs. And log a
> > deprecation warning for users.
> >
> > *@Shammon
> >
> > 1. For their functional differences, can you give some detailed tests to
> >> verify that the new FineGrainedSlotManager has these capabilities? This
> can
> >> effectively verify the new functions
> >>
> > As just maintained, there is already a CI stage of FGSM, and I will do
> more
> > review of unit tests for DSM.
> >
> >  2. I'm worried that many functions are not independent and it is
> difficult
> >> to migrate step-by-step. You can list the relationship between them in
> >> detail.
> >
> >  As Xintong saied the DSM is a subset of FGSM by design. But as time goes
> > on, FGSM has some lacking
> > functions as I listed in this FLIP. And I have added the comparison
> between
> > DSM and FGSM in this FLIP.
> >
> >
> > Thanks again for all your thoughts. Any feedback is appreciated!
> >
> > Best,
> > Weihua
> >
> >
> > On Wed, Mar 1, 2023 at 2:17 PM Xintong Song <tonysong...@gmail.com>
> wrote:
> >
> >> Thanks Weihua for preparing this FLIP. +1 for the proposal.
> >>
> >>
> >> As one of the contributors of the fine-grained slot manager, I'd like to
> >> share some backgrounds here.
> >>
> >> - There used to be a defaut slot manager implementation, which is
> >> non-declarative and has been removed now. The two features, declarative
> /
> >> reactive resource management and fine-grained resource management, were
> >> proposed at about the same time. We were aware that by design the
> >> declarative slot manager is a subset of fine-grained slot manager at
> that
> >> time, but still decided to implement two slot managers for the purpose
> of
> >> decoupling efforts and reducing cross-team synchronization overhead.
> >> Merging the two slot managers once they are proved stable is IMO a
> >> technical debt that was planned at the very beginning.
> >>
> >> - The FineGrainedSlotManager has been verified in Alibaba's internal
> >> production as well as Alibaba Cloud services as the default slot manager
> >> for about 2 years.
> >>
> >>
> >> Concerning test cases, we currently have a ci stage for fine grained
> >> resource management. To avoid adding too much burden, the stage only
> >> includes tests from flink-runtime and flink-test modules. I think
> switching
> >> the default slot manager and applying the whole set of tests on the
> >> fine-grained slot manager would help us to be more confident about it.
> >>
> >>
> >> Concerning cutting out functionalities out of slot manager, I think
> Yangze
> >> and I have tried our best to shape the FineGrainedSlotManager into
> >> reasonable components. I personally don't have other ideas to further
> >> disassemble the component, but I'm open to such suggestions. However,
> from
> >> the stability perspective, I'd be in favor of not introducing
> significant
> >> changes to the FineGrainedSlotManager while switching it to the default.
> >> Because the current implementation has already been verified (or at
> least
> >> partially verified because Alibaba does not cover all the Flink use
> cases),
> >> and introducing more changes also means more chances of breaking things.
> >>
> >>
> >> Best,
> >>
> >> Xintong
> >>
> >>
> >>
> >> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zjur...@gmail.com> wrote:
> >>
> >> > Hi
> >> >
> >> > Thanks for starting this work weihua, I think unifying
> >> > DeclarativeSlotManager and FineGrainedSlotManager is valuable.
> >> >
> >> > I agree with @Matthias and @John that we need a way to ensure that
> >> > DeclarativeSlotManager's capabilities are fully covered by
> >> > FineGrainedSlotManager
> >> >
> >> > 1. For their functional differences, can you give some detailed tests
> to
> >> > verify that the new FineGrainedSlotManager has these capabilities?
> This
> >> can
> >> > effectively verify the new functions
> >> >
> >> > 2. I'm worried that many functions are not independent and it is
> >> difficult
> >> > to migrate step-by-step. You can list the relationship between them in
> >> > detail.
> >> >
> >> > 3. As John mentioned, give a smoke test for FineGrainedSlotManager is
> a
> >> > good idea. Or you can add some test information to the
> >> > DeclarativeSlotManager to determine how many tests have used it. In
> this
> >> > way, we can gradually construct test cases for FineGrainedSlotManager
> >> > during the development process.
> >> >
> >> >
> >> > Best,
> >> > Shammon
> >> >
> >> >
> >> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler <vvcep...@apache.org>
> >> wrote:
> >> >
> >> > > Thanks for the FLIP, Weihua!
> >> > >
> >> > > I’ve read the FLIP, and it sounds good to me. We need to avoid
> >> > > proliferating alternative implementations wherever possible. I have
> >> just
> >> > a
> >> > > couple of comments:
> >> > >
> >> > > 1. I share Matthias’s concern about ensuring the behavior is really
> the
> >> > > same. One suggestion I’ve used for this kind of thing is, as a smoke
> >> > test,
> >> > > to update the DeclarativeSlotManager to just delegate to the
> >> > > FineGrainedSlotManager. If the full test suite still passes, you
> can be
> >> > > pretty sure the new default is really ok. It would not be a good
> idea
> >> to
> >> > > actually keep that in for the release, since it would remove the
> option
> >> > to
> >> > > fall back in case of bugs. Either way, we need to make sure all test
> >> > > scenarios are present for the FGSM.
> >> > >
> >> > > 4. In addition to changing the default, would it make sense to log a
> >> > > deprecation warning on initialization if the DeclarativeSlotManager
> is
> >> > used?
> >> > >
> >> > > Thanks again,
> >> > > John
> >> > >
> >> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote:
> >> > > > Hi Weihua,
> >> > > > Thanks for your proposal. From a conceptual point: AFAIU, the
> >> > > > DeclarativeSlotManager covers a subset (i.e. only evenly sized
> slots)
> >> > of
> >> > > > what the FineGrainedSlotManager should be able to achieve
> (variable
> >> > slot
> >> > > > size per task manager). Is this the right
> assumption/understanding?
> >> In
> >> > > this
> >> > > > sense, merging both implementations into a single one sounds
> good. A
> >> > few
> >> > > > more general comments, though:
> >> > > >
> >> > > > 1. Did you do a proper test coverage analysis? That's not
> mentioned
> >> in
> >> > > the
> >> > > > current version of the FLIP. I'm bringing this up because we ran
> into
> >> > the
> >> > > > same issue when fixing the flaws that popped up after introducing
> the
> >> > > > multi-component leader election (see FLIP-285 [1]). There is a
> risk
> >> > that
> >> > > by
> >> > > > removing the legacy code we decrease test coverage because certain
> >> > > > test cases that were covered for the legacy classes might not be
> >> > > > necessarily covered in the new implementation, yet (see
> FLINK-30338
> >> [2]
> >> > > > which covers this issue for the leader election case). Ideally, we
> >> > don't
> >> > > > want to remove test cases accidentally because they were only
> >> > implemented
> >> > > > for the DeclarativeSlotManager but missed for the
> >> > FineGrainedSlotManager.
> >> > > >
> >> > > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel quite
> big
> >> in
> >> > > > terms of lines of code. Without knowing whether it's actually a
> >> > > reasonable
> >> > > > thing to do: Instead of just adding more features to the
> >> > > > FineGrainedSlotManager, have you thought of cutting out
> functionality
> >> > > into
> >> > > > smaller sub-components along this refactoring? Such a step-by-step
> >> > > approach
> >> > > > might improve the overall codebase and might make reviewing the
> >> > > refactoring
> >> > > > easier. I did a first pass over the code and struggled to identify
> >> code
> >> > > > blocks that could be moved out of the SlotManager
> implementation(s).
> >> > > > Therefore, I might be wrong with this proposal. I haven't worked
> on
> >> > this
> >> > > > codebase in that detail that it would allow me to come up with a
> >> > > judgement
> >> > > > call. I wanted to bring it up, anyway, because I'm curious whether
> >> that
> >> > > > could be an option. There's a comment created by Chesnay (CC'd) in
> >> the
> >> > > > JavaDoc of TaskExecutorManager [3] indicating something similar.
> I'm
> >> > > > wondering whether he can add some insights here.
> >> > > >
> >> > > > 3. For me personally, having a more detailed summary comparing the
> >> > > > subcomponents of both SlotManager implementations with where
> >> > > > their functionality matches and where they differ might help
> >> understand
> >> > > the
> >> > > > consequences of the changes proposed in FLIP-298.
> >> > > >
> >> > > > Best,
> >> > > > Matthias
> >> > > >
> >> > > > [1]
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
> >> > > > [2] https://issues.apache.org/jira/browse/FLINK-30338
> >> > > > [3]
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
> >> > > >
> >> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <wang...@163.com>
> wrote:
> >> > > >
> >> > > >> This is a good proposal for me, it will make the code of the
> >> > SlotManager
> >> > > >> more clear.
> >> > > >>
> >> > > >>
> >> > > >>
> >> > > >> --
> >> > > >>
> >> > > >> Best,
> >> > > >> Matt Wang
> >> > > >>
> >> > > >>
> >> > > >> ---- Replied Message ----
> >> > > >> | From | David Morávek<d...@apache.org> |
> >> > > >> | Date | 02/27/2023 22:45 |
> >> > > >> | To | <dev@flink.apache.org> |
> >> > > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the Implementation
> of
> >> > > >> SlotManager |
> >> > > >> Hi Weihua, I still need to dig into the details, but the overall
> >> > > sentiment
> >> > > >> of this change sounds reasonable.
> >> > > >>
> >> > > >> Best,
> >> > > >> D.
> >> > > >>
> >> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
> >> > > zhanghao.c...@outlook.com>
> >> > > >> wrote:
> >> > > >>
> >> > > >> Thanks for driving this topic. I think this FLIP could help
> clean up
> >> > the
> >> > > >> codebase to make it easier to maintain. +1 on it.
> >> > > >>
> >> > > >> Best,
> >> > > >> Zhanghao Chen
> >> > > >> ________________________________
> >> > > >> From: Weihua Hu <huweihua....@gmail.com>
> >> > > >> Sent: Monday, February 27, 2023 20:40
> >> > > >> To: dev <dev@flink.apache.org>
> >> > > >> Subject: [DISCUSS] FLIP-298: Unifying the Implementation of
> >> > SlotManager
> >> > > >>
> >> > > >> Hi everyone,
> >> > > >>
> >> > > >> I would like to begin a discussion on FLIP-298: Unifying the
> >> > > Implementation
> >> > > >> of SlotManager[1]. There are currently two types of SlotManager
> in
> >> > > Flink:
> >> > > >> DeclarativeSlotManager and FineGrainedSlotManager.
> >> > > FineGrainedSlotManager
> >> > > >> should behave as DeclarativeSlotManager if the user does not
> >> configure
> >> > > the
> >> > > >> slot request profile.
> >> > > >>
> >> > > >> Therefore, this FLIP aims to unify the implementation of
> SlotManager
> >> > in
> >> > > >> order to reduce maintenance costs.
> >> > > >>
> >> > > >> Looking forward to hearing from you.
> >> > > >>
> >> > > >> [1]
> >> > > >>
> >> > > >>
> >> > > >>
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
> >> > > >>
> >> > > >> Best,
> >> > > >> Weihua
> >> > > >>
> >> > > >>
> >> > >
> >> >
> >>
>

Reply via email to