Hi weihua

Can you add content related to `heterogeneous resources` to this FLIP? We
can record it and consider it in the future. It may be useful for some
scenarios, such as the combination of streaming and ML.

Best,
Shammon


On Mon, Mar 6, 2023 at 1:39 PM weijie guo <guoweijieres...@gmail.com> wrote:

> Hi Weihua,
>
> Thanks for your clarification, SGTM.
>
> Best regards,
>
> Weijie
>
>
> Weihua Hu <huweihua....@gmail.com> 于2023年3月6日周一 11:43写道:
>
> > Thanks Weijie.
> >
> > Heterogeneous task managers will not be considered in this FLIP since
> > it does not request heterogeneous resources as you said.
> >
> > My first thought is we can adjust the meaning of redundant configuration
> > to redundant number of per resource type. These can be considered in
> > detail when we decide to support heterogeneous task managers.
> >
> > Best,
> > Weihua
> >
> >
> > On Sat, Mar 4, 2023 at 1:13 AM weijie guo <guoweijieres...@gmail.com>
> > wrote:
> >
> > > Thanks Weihua for preparing this FLIP.
> > >
> > > This FLIP overall looks reasonable to me after updating as suggested by
> > > Matthias.
> > >
> > > I only have one small question about keeping some redundant task
> > managers:
> > > In the fine-grained resource management, theoretically, it can support
> > > heterogeneous taskmanagers. When we complete the missing features for
> > FGSM,
> > > do we plan to take this into account?
> > > Of course, if I remember correctly, FGSM will not request heterogeneous
> > > resources at present, so it is also acceptable to me if there is no
> > special
> > > treatment now.
> > >
> > > +1 for this changes if we can ensure the test coverage.
> > >
> > > Best regards,
> > >
> > > Weijie
> > >
> > >
> > > John Roesler <vvcep...@apache.org> 于2023年3月2日周四 12:53写道:
> > >
> > > > Thanks for the test plan, Weihua!
> > > >
> > > > Yes, it addresses my concerns.
> > > >
> > > > Thanks,
> > > > John
> > > >
> > > > On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote:
> > > > > Hi, everyone,
> > > > > Thanks for your suggestions and ideas.
> > > > > Thanks Xintong for sharing the detailed backgrounds of SlotManager.
> > > > >
> > > > > *@Matthias
> > > > >
> > > > > 1. Did you do a proper test coverage analysis?
> > > > >
> > > > >
> > > > > Just as Xintong said, we already have a CI stage for fine grained
> > > > resource
> > > > > managers.
> > > > > And I will make sure FineGrainedSlotManager as the default
> > SlotManager
> > > > can
> > > > > pass all the tests of CI.
> > > > > In addition, I will review all unit tests of
> > > DeclarativeSlotManager(DSM)
> > > > to
> > > > > ensure that there are no gaps in the
> > > > > coverage provided by the FineGrainedSlotManager.
> > > > > I also added the 'Test Plan' part to the FLIP.
> > > > > @Matthias @John @Shammon Does this test plan address your concerns?
> > > > >
> > > > > 2.  DeclarativeSlotManager and FineGrainedSlotManager feel quite
> big
> > in
> > > > >
> > > > > terms of lines of code....
> > > > >
> > > > >
> > > > > IMO, the refactoring of SlotManager does not belong to this FLIP
> > since
> > > it
> > > > > may lead to some unstable risks. For
> > > > > FineGrainedSlotManager(FGSM), we already split some reasonable
> > > > components.
> > > > > They are:
> > > > > * TaskManagerTracker: Track task managers and their resources.
> > > > > * ResourceTracker: track requirements of jobs
> > > > > * ResourceAllocationStrategy: Try to fulfill the resource
> > requirements
> > > > with
> > > > > available/pending resources.
> > > > > * SlotStatusSyncer: communicate with TaskManager, for
> > > allocating/freeing
> > > > > slot and reconciling the slot status
> > > > > Maybe we can start a discussion about refactoring SlotManager in
> > > another
> > > > > FLIP if there are some good suggestions.
> > > > > WDYT
> > > > >
> > > > > 3. For me personally, having a more detailed summary comparing the
> > > > >> subcomponents of both SlotManager implementations with where
> > > > >> their functionality matches and where they differ might help
> > > understand
> > > > the
> > > > >> consequences of the changes proposed in FLIP-298
> > > > >
> > > > > Good suggestion, I have updated the comparison in this FLIP.
> Looking
> > > > > forward to any suggestions/thoughts
> > > > > if they are not described clearly.
> > > > >
> > > > > *@John
> > > > >
> > > > > 4. In addition to changing the default, would it make sense to log
> a
> > > > >> deprecation warning on initialization
> > > > >
> > > > > if the DeclarativeSlotManager is used?
> > > > >>
> > > > > SGTM, We should add Deprecated annotations to DSM for devs. And
> log a
> > > > > deprecation warning for users.
> > > > >
> > > > > *@Shammon
> > > > >
> > > > > 1. For their functional differences, can you give some detailed
> tests
> > > to
> > > > >> verify that the new FineGrainedSlotManager has these capabilities?
> > > This
> > > > can
> > > > >> effectively verify the new functions
> > > > >>
> > > > > As just maintained, there is already a CI stage of FGSM, and I will
> > do
> > > > more
> > > > > review of unit tests for DSM.
> > > > >
> > > > >  2. I'm worried that many functions are not independent and it is
> > > > difficult
> > > > >> to migrate step-by-step. You can list the relationship between
> them
> > in
> > > > >> detail.
> > > > >
> > > > >  As Xintong saied the DSM is a subset of FGSM by design. But as
> time
> > > goes
> > > > > on, FGSM has some lacking
> > > > > functions as I listed in this FLIP. And I have added the comparison
> > > > between
> > > > > DSM and FGSM in this FLIP.
> > > > >
> > > > >
> > > > > Thanks again for all your thoughts. Any feedback is appreciated!
> > > > >
> > > > > Best,
> > > > > Weihua
> > > > >
> > > > >
> > > > > On Wed, Mar 1, 2023 at 2:17 PM Xintong Song <tonysong...@gmail.com
> >
> > > > wrote:
> > > > >
> > > > >> Thanks Weihua for preparing this FLIP. +1 for the proposal.
> > > > >>
> > > > >>
> > > > >> As one of the contributors of the fine-grained slot manager, I'd
> > like
> > > to
> > > > >> share some backgrounds here.
> > > > >>
> > > > >> - There used to be a defaut slot manager implementation, which is
> > > > >> non-declarative and has been removed now. The two features,
> > > declarative
> > > > /
> > > > >> reactive resource management and fine-grained resource management,
> > > were
> > > > >> proposed at about the same time. We were aware that by design the
> > > > >> declarative slot manager is a subset of fine-grained slot manager
> at
> > > > that
> > > > >> time, but still decided to implement two slot managers for the
> > purpose
> > > > of
> > > > >> decoupling efforts and reducing cross-team synchronization
> overhead.
> > > > >> Merging the two slot managers once they are proved stable is IMO a
> > > > >> technical debt that was planned at the very beginning.
> > > > >>
> > > > >> - The FineGrainedSlotManager has been verified in Alibaba's
> internal
> > > > >> production as well as Alibaba Cloud services as the default slot
> > > manager
> > > > >> for about 2 years.
> > > > >>
> > > > >>
> > > > >> Concerning test cases, we currently have a ci stage for fine
> grained
> > > > >> resource management. To avoid adding too much burden, the stage
> only
> > > > >> includes tests from flink-runtime and flink-test modules. I think
> > > > switching
> > > > >> the default slot manager and applying the whole set of tests on
> the
> > > > >> fine-grained slot manager would help us to be more confident about
> > it.
> > > > >>
> > > > >>
> > > > >> Concerning cutting out functionalities out of slot manager, I
> think
> > > > Yangze
> > > > >> and I have tried our best to shape the FineGrainedSlotManager into
> > > > >> reasonable components. I personally don't have other ideas to
> > further
> > > > >> disassemble the component, but I'm open to such suggestions.
> > However,
> > > > from
> > > > >> the stability perspective, I'd be in favor of not introducing
> > > > significant
> > > > >> changes to the FineGrainedSlotManager while switching it to the
> > > default.
> > > > >> Because the current implementation has already been verified (or
> at
> > > > least
> > > > >> partially verified because Alibaba does not cover all the Flink
> use
> > > > cases),
> > > > >> and introducing more changes also means more chances of breaking
> > > things.
> > > > >>
> > > > >>
> > > > >> Best,
> > > > >>
> > > > >> Xintong
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zjur...@gmail.com>
> > wrote:
> > > > >>
> > > > >> > Hi
> > > > >> >
> > > > >> > Thanks for starting this work weihua, I think unifying
> > > > >> > DeclarativeSlotManager and FineGrainedSlotManager is valuable.
> > > > >> >
> > > > >> > I agree with @Matthias and @John that we need a way to ensure
> that
> > > > >> > DeclarativeSlotManager's capabilities are fully covered by
> > > > >> > FineGrainedSlotManager
> > > > >> >
> > > > >> > 1. For their functional differences, can you give some detailed
> > > tests
> > > > to
> > > > >> > verify that the new FineGrainedSlotManager has these
> capabilities?
> > > > This
> > > > >> can
> > > > >> > effectively verify the new functions
> > > > >> >
> > > > >> > 2. I'm worried that many functions are not independent and it is
> > > > >> difficult
> > > > >> > to migrate step-by-step. You can list the relationship between
> > them
> > > in
> > > > >> > detail.
> > > > >> >
> > > > >> > 3. As John mentioned, give a smoke test for
> FineGrainedSlotManager
> > > is
> > > > a
> > > > >> > good idea. Or you can add some test information to the
> > > > >> > DeclarativeSlotManager to determine how many tests have used it.
> > In
> > > > this
> > > > >> > way, we can gradually construct test cases for
> > > FineGrainedSlotManager
> > > > >> > during the development process.
> > > > >> >
> > > > >> >
> > > > >> > Best,
> > > > >> > Shammon
> > > > >> >
> > > > >> >
> > > > >> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler <
> > vvcep...@apache.org>
> > > > >> wrote:
> > > > >> >
> > > > >> > > Thanks for the FLIP, Weihua!
> > > > >> > >
> > > > >> > > I’ve read the FLIP, and it sounds good to me. We need to avoid
> > > > >> > > proliferating alternative implementations wherever possible. I
> > > have
> > > > >> just
> > > > >> > a
> > > > >> > > couple of comments:
> > > > >> > >
> > > > >> > > 1. I share Matthias’s concern about ensuring the behavior is
> > > really
> > > > the
> > > > >> > > same. One suggestion I’ve used for this kind of thing is, as a
> > > smoke
> > > > >> > test,
> > > > >> > > to update the DeclarativeSlotManager to just delegate to the
> > > > >> > > FineGrainedSlotManager. If the full test suite still passes,
> you
> > > > can be
> > > > >> > > pretty sure the new default is really ok. It would not be a
> good
> > > > idea
> > > > >> to
> > > > >> > > actually keep that in for the release, since it would remove
> the
> > > > option
> > > > >> > to
> > > > >> > > fall back in case of bugs. Either way, we need to make sure
> all
> > > test
> > > > >> > > scenarios are present for the FGSM.
> > > > >> > >
> > > > >> > > 4. In addition to changing the default, would it make sense to
> > > log a
> > > > >> > > deprecation warning on initialization if the
> > > DeclarativeSlotManager
> > > > is
> > > > >> > used?
> > > > >> > >
> > > > >> > > Thanks again,
> > > > >> > > John
> > > > >> > >
> > > > >> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote:
> > > > >> > > > Hi Weihua,
> > > > >> > > > Thanks for your proposal. From a conceptual point: AFAIU,
> the
> > > > >> > > > DeclarativeSlotManager covers a subset (i.e. only evenly
> sized
> > > > slots)
> > > > >> > of
> > > > >> > > > what the FineGrainedSlotManager should be able to achieve
> > > > (variable
> > > > >> > slot
> > > > >> > > > size per task manager). Is this the right
> > > > assumption/understanding?
> > > > >> In
> > > > >> > > this
> > > > >> > > > sense, merging both implementations into a single one sounds
> > > > good. A
> > > > >> > few
> > > > >> > > > more general comments, though:
> > > > >> > > >
> > > > >> > > > 1. Did you do a proper test coverage analysis? That's not
> > > > mentioned
> > > > >> in
> > > > >> > > the
> > > > >> > > > current version of the FLIP. I'm bringing this up because we
> > ran
> > > > into
> > > > >> > the
> > > > >> > > > same issue when fixing the flaws that popped up after
> > > introducing
> > > > the
> > > > >> > > > multi-component leader election (see FLIP-285 [1]). There
> is a
> > > > risk
> > > > >> > that
> > > > >> > > by
> > > > >> > > > removing the legacy code we decrease test coverage because
> > > certain
> > > > >> > > > test cases that were covered for the legacy classes might
> not
> > be
> > > > >> > > > necessarily covered in the new implementation, yet (see
> > > > FLINK-30338
> > > > >> [2]
> > > > >> > > > which covers this issue for the leader election case).
> > Ideally,
> > > we
> > > > >> > don't
> > > > >> > > > want to remove test cases accidentally because they were
> only
> > > > >> > implemented
> > > > >> > > > for the DeclarativeSlotManager but missed for the
> > > > >> > FineGrainedSlotManager.
> > > > >> > > >
> > > > >> > > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel
> > quite
> > > > big
> > > > >> in
> > > > >> > > > terms of lines of code. Without knowing whether it's
> actually
> > a
> > > > >> > > reasonable
> > > > >> > > > thing to do: Instead of just adding more features to the
> > > > >> > > > FineGrainedSlotManager, have you thought of cutting out
> > > > functionality
> > > > >> > > into
> > > > >> > > > smaller sub-components along this refactoring? Such a
> > > step-by-step
> > > > >> > > approach
> > > > >> > > > might improve the overall codebase and might make reviewing
> > the
> > > > >> > > refactoring
> > > > >> > > > easier. I did a first pass over the code and struggled to
> > > identify
> > > > >> code
> > > > >> > > > blocks that could be moved out of the SlotManager
> > > > implementation(s).
> > > > >> > > > Therefore, I might be wrong with this proposal. I haven't
> > worked
> > > > on
> > > > >> > this
> > > > >> > > > codebase in that detail that it would allow me to come up
> > with a
> > > > >> > > judgement
> > > > >> > > > call. I wanted to bring it up, anyway, because I'm curious
> > > whether
> > > > >> that
> > > > >> > > > could be an option. There's a comment created by Chesnay
> > (CC'd)
> > > in
> > > > >> the
> > > > >> > > > JavaDoc of TaskExecutorManager [3] indicating something
> > similar.
> > > > I'm
> > > > >> > > > wondering whether he can add some insights here.
> > > > >> > > >
> > > > >> > > > 3. For me personally, having a more detailed summary
> comparing
> > > the
> > > > >> > > > subcomponents of both SlotManager implementations with where
> > > > >> > > > their functionality matches and where they differ might help
> > > > >> understand
> > > > >> > > the
> > > > >> > > > consequences of the changes proposed in FLIP-298.
> > > > >> > > >
> > > > >> > > > Best,
> > > > >> > > > Matthias
> > > > >> > > >
> > > > >> > > > [1]
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
> > > > >> > > > [2] https://issues.apache.org/jira/browse/FLINK-30338
> > > > >> > > > [3]
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
> > > > >> > > >
> > > > >> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <wang...@163.com>
> > > > wrote:
> > > > >> > > >
> > > > >> > > >> This is a good proposal for me, it will make the code of
> the
> > > > >> > SlotManager
> > > > >> > > >> more clear.
> > > > >> > > >>
> > > > >> > > >>
> > > > >> > > >>
> > > > >> > > >> --
> > > > >> > > >>
> > > > >> > > >> Best,
> > > > >> > > >> Matt Wang
> > > > >> > > >>
> > > > >> > > >>
> > > > >> > > >> ---- Replied Message ----
> > > > >> > > >> | From | David Morávek<d...@apache.org> |
> > > > >> > > >> | Date | 02/27/2023 22:45 |
> > > > >> > > >> | To | <dev@flink.apache.org> |
> > > > >> > > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the
> > Implementation
> > > > of
> > > > >> > > >> SlotManager |
> > > > >> > > >> Hi Weihua, I still need to dig into the details, but the
> > > overall
> > > > >> > > sentiment
> > > > >> > > >> of this change sounds reasonable.
> > > > >> > > >>
> > > > >> > > >> Best,
> > > > >> > > >> D.
> > > > >> > > >>
> > > > >> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
> > > > >> > > zhanghao.c...@outlook.com>
> > > > >> > > >> wrote:
> > > > >> > > >>
> > > > >> > > >> Thanks for driving this topic. I think this FLIP could help
> > > > clean up
> > > > >> > the
> > > > >> > > >> codebase to make it easier to maintain. +1 on it.
> > > > >> > > >>
> > > > >> > > >> Best,
> > > > >> > > >> Zhanghao Chen
> > > > >> > > >> ________________________________
> > > > >> > > >> From: Weihua Hu <huweihua....@gmail.com>
> > > > >> > > >> Sent: Monday, February 27, 2023 20:40
> > > > >> > > >> To: dev <dev@flink.apache.org>
> > > > >> > > >> Subject: [DISCUSS] FLIP-298: Unifying the Implementation of
> > > > >> > SlotManager
> > > > >> > > >>
> > > > >> > > >> Hi everyone,
> > > > >> > > >>
> > > > >> > > >> I would like to begin a discussion on FLIP-298: Unifying
> the
> > > > >> > > Implementation
> > > > >> > > >> of SlotManager[1]. There are currently two types of
> > SlotManager
> > > > in
> > > > >> > > Flink:
> > > > >> > > >> DeclarativeSlotManager and FineGrainedSlotManager.
> > > > >> > > FineGrainedSlotManager
> > > > >> > > >> should behave as DeclarativeSlotManager if the user does
> not
> > > > >> configure
> > > > >> > > the
> > > > >> > > >> slot request profile.
> > > > >> > > >>
> > > > >> > > >> Therefore, this FLIP aims to unify the implementation of
> > > > SlotManager
> > > > >> > in
> > > > >> > > >> order to reduce maintenance costs.
> > > > >> > > >>
> > > > >> > > >> Looking forward to hearing from you.
> > > > >> > > >>
> > > > >> > > >> [1]
> > > > >> > > >>
> > > > >> > > >>
> > > > >> > > >>
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
> > > > >> > > >>
> > > > >> > > >> Best,
> > > > >> > > >> Weihua
> > > > >> > > >>
> > > > >> > > >>
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
>

Reply via email to