Hi,

@David @Matthias
There are a few days after hearing your thoughts. I would like to know if
there are any other concerns about this FLIP.


Best,
Weihua


On Mon, Mar 6, 2023 at 7:53 PM Weihua Hu <huweihua....@gmail.com> wrote:

>
> Thanks Shammon,
>
> I've updated FLIP to add this redundant Task Manager limitation.
>
>
> Best,
> Weihua
>
>
> On Mon, Mar 6, 2023 at 5:07 PM Shammon FY <zjur...@gmail.com> wrote:
>
>> Hi weihua
>>
>> Can you add content related to `heterogeneous resources` to this FLIP? We
>> can record it and consider it in the future. It may be useful for some
>> scenarios, such as the combination of streaming and ML.
>>
>> Best,
>> Shammon
>>
>>
>> On Mon, Mar 6, 2023 at 1:39 PM weijie guo <guoweijieres...@gmail.com>
>> wrote:
>>
>> > Hi Weihua,
>> >
>> > Thanks for your clarification, SGTM.
>> >
>> > Best regards,
>> >
>> > Weijie
>> >
>> >
>> > Weihua Hu <huweihua....@gmail.com> 于2023年3月6日周一 11:43写道:
>> >
>> > > Thanks Weijie.
>> > >
>> > > Heterogeneous task managers will not be considered in this FLIP since
>> > > it does not request heterogeneous resources as you said.
>> > >
>> > > My first thought is we can adjust the meaning of redundant
>> configuration
>> > > to redundant number of per resource type. These can be considered in
>> > > detail when we decide to support heterogeneous task managers.
>> > >
>> > > Best,
>> > > Weihua
>> > >
>> > >
>> > > On Sat, Mar 4, 2023 at 1:13 AM weijie guo <guoweijieres...@gmail.com>
>> > > wrote:
>> > >
>> > > > Thanks Weihua for preparing this FLIP.
>> > > >
>> > > > This FLIP overall looks reasonable to me after updating as
>> suggested by
>> > > > Matthias.
>> > > >
>> > > > I only have one small question about keeping some redundant task
>> > > managers:
>> > > > In the fine-grained resource management, theoretically, it can
>> support
>> > > > heterogeneous taskmanagers. When we complete the missing features
>> for
>> > > FGSM,
>> > > > do we plan to take this into account?
>> > > > Of course, if I remember correctly, FGSM will not request
>> heterogeneous
>> > > > resources at present, so it is also acceptable to me if there is no
>> > > special
>> > > > treatment now.
>> > > >
>> > > > +1 for this changes if we can ensure the test coverage.
>> > > >
>> > > > Best regards,
>> > > >
>> > > > Weijie
>> > > >
>> > > >
>> > > > John Roesler <vvcep...@apache.org> 于2023年3月2日周四 12:53写道:
>> > > >
>> > > > > Thanks for the test plan, Weihua!
>> > > > >
>> > > > > Yes, it addresses my concerns.
>> > > > >
>> > > > > Thanks,
>> > > > > John
>> > > > >
>> > > > > On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote:
>> > > > > > Hi, everyone,
>> > > > > > Thanks for your suggestions and ideas.
>> > > > > > Thanks Xintong for sharing the detailed backgrounds of
>> SlotManager.
>> > > > > >
>> > > > > > *@Matthias
>> > > > > >
>> > > > > > 1. Did you do a proper test coverage analysis?
>> > > > > >
>> > > > > >
>> > > > > > Just as Xintong said, we already have a CI stage for fine
>> grained
>> > > > > resource
>> > > > > > managers.
>> > > > > > And I will make sure FineGrainedSlotManager as the default
>> > > SlotManager
>> > > > > can
>> > > > > > pass all the tests of CI.
>> > > > > > In addition, I will review all unit tests of
>> > > > DeclarativeSlotManager(DSM)
>> > > > > to
>> > > > > > ensure that there are no gaps in the
>> > > > > > coverage provided by the FineGrainedSlotManager.
>> > > > > > I also added the 'Test Plan' part to the FLIP.
>> > > > > > @Matthias @John @Shammon Does this test plan address your
>> concerns?
>> > > > > >
>> > > > > > 2.  DeclarativeSlotManager and FineGrainedSlotManager feel quite
>> > big
>> > > in
>> > > > > >
>> > > > > > terms of lines of code....
>> > > > > >
>> > > > > >
>> > > > > > IMO, the refactoring of SlotManager does not belong to this FLIP
>> > > since
>> > > > it
>> > > > > > may lead to some unstable risks. For
>> > > > > > FineGrainedSlotManager(FGSM), we already split some reasonable
>> > > > > components.
>> > > > > > They are:
>> > > > > > * TaskManagerTracker: Track task managers and their resources.
>> > > > > > * ResourceTracker: track requirements of jobs
>> > > > > > * ResourceAllocationStrategy: Try to fulfill the resource
>> > > requirements
>> > > > > with
>> > > > > > available/pending resources.
>> > > > > > * SlotStatusSyncer: communicate with TaskManager, for
>> > > > allocating/freeing
>> > > > > > slot and reconciling the slot status
>> > > > > > Maybe we can start a discussion about refactoring SlotManager in
>> > > > another
>> > > > > > FLIP if there are some good suggestions.
>> > > > > > WDYT
>> > > > > >
>> > > > > > 3. For me personally, having a more detailed summary comparing
>> the
>> > > > > >> subcomponents of both SlotManager implementations with where
>> > > > > >> their functionality matches and where they differ might help
>> > > > understand
>> > > > > the
>> > > > > >> consequences of the changes proposed in FLIP-298
>> > > > > >
>> > > > > > Good suggestion, I have updated the comparison in this FLIP.
>> > Looking
>> > > > > > forward to any suggestions/thoughts
>> > > > > > if they are not described clearly.
>> > > > > >
>> > > > > > *@John
>> > > > > >
>> > > > > > 4. In addition to changing the default, would it make sense to
>> log
>> > a
>> > > > > >> deprecation warning on initialization
>> > > > > >
>> > > > > > if the DeclarativeSlotManager is used?
>> > > > > >>
>> > > > > > SGTM, We should add Deprecated annotations to DSM for devs. And
>> > log a
>> > > > > > deprecation warning for users.
>> > > > > >
>> > > > > > *@Shammon
>> > > > > >
>> > > > > > 1. For their functional differences, can you give some detailed
>> > tests
>> > > > to
>> > > > > >> verify that the new FineGrainedSlotManager has these
>> capabilities?
>> > > > This
>> > > > > can
>> > > > > >> effectively verify the new functions
>> > > > > >>
>> > > > > > As just maintained, there is already a CI stage of FGSM, and I
>> will
>> > > do
>> > > > > more
>> > > > > > review of unit tests for DSM.
>> > > > > >
>> > > > > >  2. I'm worried that many functions are not independent and it
>> is
>> > > > > difficult
>> > > > > >> to migrate step-by-step. You can list the relationship between
>> > them
>> > > in
>> > > > > >> detail.
>> > > > > >
>> > > > > >  As Xintong saied the DSM is a subset of FGSM by design. But as
>> > time
>> > > > goes
>> > > > > > on, FGSM has some lacking
>> > > > > > functions as I listed in this FLIP. And I have added the
>> comparison
>> > > > > between
>> > > > > > DSM and FGSM in this FLIP.
>> > > > > >
>> > > > > >
>> > > > > > Thanks again for all your thoughts. Any feedback is appreciated!
>> > > > > >
>> > > > > > Best,
>> > > > > > Weihua
>> > > > > >
>> > > > > >
>> > > > > > On Wed, Mar 1, 2023 at 2:17 PM Xintong Song <
>> tonysong...@gmail.com
>> > >
>> > > > > wrote:
>> > > > > >
>> > > > > >> Thanks Weihua for preparing this FLIP. +1 for the proposal.
>> > > > > >>
>> > > > > >>
>> > > > > >> As one of the contributors of the fine-grained slot manager,
>> I'd
>> > > like
>> > > > to
>> > > > > >> share some backgrounds here.
>> > > > > >>
>> > > > > >> - There used to be a defaut slot manager implementation, which
>> is
>> > > > > >> non-declarative and has been removed now. The two features,
>> > > > declarative
>> > > > > /
>> > > > > >> reactive resource management and fine-grained resource
>> management,
>> > > > were
>> > > > > >> proposed at about the same time. We were aware that by design
>> the
>> > > > > >> declarative slot manager is a subset of fine-grained slot
>> manager
>> > at
>> > > > > that
>> > > > > >> time, but still decided to implement two slot managers for the
>> > > purpose
>> > > > > of
>> > > > > >> decoupling efforts and reducing cross-team synchronization
>> > overhead.
>> > > > > >> Merging the two slot managers once they are proved stable is
>> IMO a
>> > > > > >> technical debt that was planned at the very beginning.
>> > > > > >>
>> > > > > >> - The FineGrainedSlotManager has been verified in Alibaba's
>> > internal
>> > > > > >> production as well as Alibaba Cloud services as the default
>> slot
>> > > > manager
>> > > > > >> for about 2 years.
>> > > > > >>
>> > > > > >>
>> > > > > >> Concerning test cases, we currently have a ci stage for fine
>> > grained
>> > > > > >> resource management. To avoid adding too much burden, the stage
>> > only
>> > > > > >> includes tests from flink-runtime and flink-test modules. I
>> think
>> > > > > switching
>> > > > > >> the default slot manager and applying the whole set of tests on
>> > the
>> > > > > >> fine-grained slot manager would help us to be more confident
>> about
>> > > it.
>> > > > > >>
>> > > > > >>
>> > > > > >> Concerning cutting out functionalities out of slot manager, I
>> > think
>> > > > > Yangze
>> > > > > >> and I have tried our best to shape the FineGrainedSlotManager
>> into
>> > > > > >> reasonable components. I personally don't have other ideas to
>> > > further
>> > > > > >> disassemble the component, but I'm open to such suggestions.
>> > > However,
>> > > > > from
>> > > > > >> the stability perspective, I'd be in favor of not introducing
>> > > > > significant
>> > > > > >> changes to the FineGrainedSlotManager while switching it to the
>> > > > default.
>> > > > > >> Because the current implementation has already been verified
>> (or
>> > at
>> > > > > least
>> > > > > >> partially verified because Alibaba does not cover all the Flink
>> > use
>> > > > > cases),
>> > > > > >> and introducing more changes also means more chances of
>> breaking
>> > > > things.
>> > > > > >>
>> > > > > >>
>> > > > > >> Best,
>> > > > > >>
>> > > > > >> Xintong
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > > >> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zjur...@gmail.com>
>> > > wrote:
>> > > > > >>
>> > > > > >> > Hi
>> > > > > >> >
>> > > > > >> > Thanks for starting this work weihua, I think unifying
>> > > > > >> > DeclarativeSlotManager and FineGrainedSlotManager is
>> valuable.
>> > > > > >> >
>> > > > > >> > I agree with @Matthias and @John that we need a way to ensure
>> > that
>> > > > > >> > DeclarativeSlotManager's capabilities are fully covered by
>> > > > > >> > FineGrainedSlotManager
>> > > > > >> >
>> > > > > >> > 1. For their functional differences, can you give some
>> detailed
>> > > > tests
>> > > > > to
>> > > > > >> > verify that the new FineGrainedSlotManager has these
>> > capabilities?
>> > > > > This
>> > > > > >> can
>> > > > > >> > effectively verify the new functions
>> > > > > >> >
>> > > > > >> > 2. I'm worried that many functions are not independent and
>> it is
>> > > > > >> difficult
>> > > > > >> > to migrate step-by-step. You can list the relationship
>> between
>> > > them
>> > > > in
>> > > > > >> > detail.
>> > > > > >> >
>> > > > > >> > 3. As John mentioned, give a smoke test for
>> > FineGrainedSlotManager
>> > > > is
>> > > > > a
>> > > > > >> > good idea. Or you can add some test information to the
>> > > > > >> > DeclarativeSlotManager to determine how many tests have used
>> it.
>> > > In
>> > > > > this
>> > > > > >> > way, we can gradually construct test cases for
>> > > > FineGrainedSlotManager
>> > > > > >> > during the development process.
>> > > > > >> >
>> > > > > >> >
>> > > > > >> > Best,
>> > > > > >> > Shammon
>> > > > > >> >
>> > > > > >> >
>> > > > > >> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler <
>> > > vvcep...@apache.org>
>> > > > > >> wrote:
>> > > > > >> >
>> > > > > >> > > Thanks for the FLIP, Weihua!
>> > > > > >> > >
>> > > > > >> > > I’ve read the FLIP, and it sounds good to me. We need to
>> avoid
>> > > > > >> > > proliferating alternative implementations wherever
>> possible. I
>> > > > have
>> > > > > >> just
>> > > > > >> > a
>> > > > > >> > > couple of comments:
>> > > > > >> > >
>> > > > > >> > > 1. I share Matthias’s concern about ensuring the behavior
>> is
>> > > > really
>> > > > > the
>> > > > > >> > > same. One suggestion I’ve used for this kind of thing is,
>> as a
>> > > > smoke
>> > > > > >> > test,
>> > > > > >> > > to update the DeclarativeSlotManager to just delegate to
>> the
>> > > > > >> > > FineGrainedSlotManager. If the full test suite still
>> passes,
>> > you
>> > > > > can be
>> > > > > >> > > pretty sure the new default is really ok. It would not be a
>> > good
>> > > > > idea
>> > > > > >> to
>> > > > > >> > > actually keep that in for the release, since it would
>> remove
>> > the
>> > > > > option
>> > > > > >> > to
>> > > > > >> > > fall back in case of bugs. Either way, we need to make sure
>> > all
>> > > > test
>> > > > > >> > > scenarios are present for the FGSM.
>> > > > > >> > >
>> > > > > >> > > 4. In addition to changing the default, would it make
>> sense to
>> > > > log a
>> > > > > >> > > deprecation warning on initialization if the
>> > > > DeclarativeSlotManager
>> > > > > is
>> > > > > >> > used?
>> > > > > >> > >
>> > > > > >> > > Thanks again,
>> > > > > >> > > John
>> > > > > >> > >
>> > > > > >> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote:
>> > > > > >> > > > Hi Weihua,
>> > > > > >> > > > Thanks for your proposal. From a conceptual point: AFAIU,
>> > the
>> > > > > >> > > > DeclarativeSlotManager covers a subset (i.e. only evenly
>> > sized
>> > > > > slots)
>> > > > > >> > of
>> > > > > >> > > > what the FineGrainedSlotManager should be able to achieve
>> > > > > (variable
>> > > > > >> > slot
>> > > > > >> > > > size per task manager). Is this the right
>> > > > > assumption/understanding?
>> > > > > >> In
>> > > > > >> > > this
>> > > > > >> > > > sense, merging both implementations into a single one
>> sounds
>> > > > > good. A
>> > > > > >> > few
>> > > > > >> > > > more general comments, though:
>> > > > > >> > > >
>> > > > > >> > > > 1. Did you do a proper test coverage analysis? That's not
>> > > > > mentioned
>> > > > > >> in
>> > > > > >> > > the
>> > > > > >> > > > current version of the FLIP. I'm bringing this up
>> because we
>> > > ran
>> > > > > into
>> > > > > >> > the
>> > > > > >> > > > same issue when fixing the flaws that popped up after
>> > > > introducing
>> > > > > the
>> > > > > >> > > > multi-component leader election (see FLIP-285 [1]). There
>> > is a
>> > > > > risk
>> > > > > >> > that
>> > > > > >> > > by
>> > > > > >> > > > removing the legacy code we decrease test coverage
>> because
>> > > > certain
>> > > > > >> > > > test cases that were covered for the legacy classes might
>> > not
>> > > be
>> > > > > >> > > > necessarily covered in the new implementation, yet (see
>> > > > > FLINK-30338
>> > > > > >> [2]
>> > > > > >> > > > which covers this issue for the leader election case).
>> > > Ideally,
>> > > > we
>> > > > > >> > don't
>> > > > > >> > > > want to remove test cases accidentally because they were
>> > only
>> > > > > >> > implemented
>> > > > > >> > > > for the DeclarativeSlotManager but missed for the
>> > > > > >> > FineGrainedSlotManager.
>> > > > > >> > > >
>> > > > > >> > > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel
>> > > quite
>> > > > > big
>> > > > > >> in
>> > > > > >> > > > terms of lines of code. Without knowing whether it's
>> > actually
>> > > a
>> > > > > >> > > reasonable
>> > > > > >> > > > thing to do: Instead of just adding more features to the
>> > > > > >> > > > FineGrainedSlotManager, have you thought of cutting out
>> > > > > functionality
>> > > > > >> > > into
>> > > > > >> > > > smaller sub-components along this refactoring? Such a
>> > > > step-by-step
>> > > > > >> > > approach
>> > > > > >> > > > might improve the overall codebase and might make
>> reviewing
>> > > the
>> > > > > >> > > refactoring
>> > > > > >> > > > easier. I did a first pass over the code and struggled to
>> > > > identify
>> > > > > >> code
>> > > > > >> > > > blocks that could be moved out of the SlotManager
>> > > > > implementation(s).
>> > > > > >> > > > Therefore, I might be wrong with this proposal. I haven't
>> > > worked
>> > > > > on
>> > > > > >> > this
>> > > > > >> > > > codebase in that detail that it would allow me to come up
>> > > with a
>> > > > > >> > > judgement
>> > > > > >> > > > call. I wanted to bring it up, anyway, because I'm
>> curious
>> > > > whether
>> > > > > >> that
>> > > > > >> > > > could be an option. There's a comment created by Chesnay
>> > > (CC'd)
>> > > > in
>> > > > > >> the
>> > > > > >> > > > JavaDoc of TaskExecutorManager [3] indicating something
>> > > similar.
>> > > > > I'm
>> > > > > >> > > > wondering whether he can add some insights here.
>> > > > > >> > > >
>> > > > > >> > > > 3. For me personally, having a more detailed summary
>> > comparing
>> > > > the
>> > > > > >> > > > subcomponents of both SlotManager implementations with
>> where
>> > > > > >> > > > their functionality matches and where they differ might
>> help
>> > > > > >> understand
>> > > > > >> > > the
>> > > > > >> > > > consequences of the changes proposed in FLIP-298.
>> > > > > >> > > >
>> > > > > >> > > > Best,
>> > > > > >> > > > Matthias
>> > > > > >> > > >
>> > > > > >> > > > [1]
>> > > > > >> > > >
>> > > > > >> > >
>> > > > > >> >
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
>> > > > > >> > > > [2] https://issues.apache.org/jira/browse/FLINK-30338
>> > > > > >> > > > [3]
>> > > > > >> > > >
>> > > > > >> > >
>> > > > > >> >
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
>> > > > > >> > > >
>> > > > > >> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <
>> wang...@163.com>
>> > > > > wrote:
>> > > > > >> > > >
>> > > > > >> > > >> This is a good proposal for me, it will make the code of
>> > the
>> > > > > >> > SlotManager
>> > > > > >> > > >> more clear.
>> > > > > >> > > >>
>> > > > > >> > > >>
>> > > > > >> > > >>
>> > > > > >> > > >> --
>> > > > > >> > > >>
>> > > > > >> > > >> Best,
>> > > > > >> > > >> Matt Wang
>> > > > > >> > > >>
>> > > > > >> > > >>
>> > > > > >> > > >> ---- Replied Message ----
>> > > > > >> > > >> | From | David Morávek<d...@apache.org> |
>> > > > > >> > > >> | Date | 02/27/2023 22:45 |
>> > > > > >> > > >> | To | <dev@flink.apache.org> |
>> > > > > >> > > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the
>> > > Implementation
>> > > > > of
>> > > > > >> > > >> SlotManager |
>> > > > > >> > > >> Hi Weihua, I still need to dig into the details, but the
>> > > > overall
>> > > > > >> > > sentiment
>> > > > > >> > > >> of this change sounds reasonable.
>> > > > > >> > > >>
>> > > > > >> > > >> Best,
>> > > > > >> > > >> D.
>> > > > > >> > > >>
>> > > > > >> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
>> > > > > >> > > zhanghao.c...@outlook.com>
>> > > > > >> > > >> wrote:
>> > > > > >> > > >>
>> > > > > >> > > >> Thanks for driving this topic. I think this FLIP could
>> help
>> > > > > clean up
>> > > > > >> > the
>> > > > > >> > > >> codebase to make it easier to maintain. +1 on it.
>> > > > > >> > > >>
>> > > > > >> > > >> Best,
>> > > > > >> > > >> Zhanghao Chen
>> > > > > >> > > >> ________________________________
>> > > > > >> > > >> From: Weihua Hu <huweihua....@gmail.com>
>> > > > > >> > > >> Sent: Monday, February 27, 2023 20:40
>> > > > > >> > > >> To: dev <dev@flink.apache.org>
>> > > > > >> > > >> Subject: [DISCUSS] FLIP-298: Unifying the
>> Implementation of
>> > > > > >> > SlotManager
>> > > > > >> > > >>
>> > > > > >> > > >> Hi everyone,
>> > > > > >> > > >>
>> > > > > >> > > >> I would like to begin a discussion on FLIP-298: Unifying
>> > the
>> > > > > >> > > Implementation
>> > > > > >> > > >> of SlotManager[1]. There are currently two types of
>> > > SlotManager
>> > > > > in
>> > > > > >> > > Flink:
>> > > > > >> > > >> DeclarativeSlotManager and FineGrainedSlotManager.
>> > > > > >> > > FineGrainedSlotManager
>> > > > > >> > > >> should behave as DeclarativeSlotManager if the user does
>> > not
>> > > > > >> configure
>> > > > > >> > > the
>> > > > > >> > > >> slot request profile.
>> > > > > >> > > >>
>> > > > > >> > > >> Therefore, this FLIP aims to unify the implementation of
>> > > > > SlotManager
>> > > > > >> > in
>> > > > > >> > > >> order to reduce maintenance costs.
>> > > > > >> > > >>
>> > > > > >> > > >> Looking forward to hearing from you.
>> > > > > >> > > >>
>> > > > > >> > > >> [1]
>> > > > > >> > > >>
>> > > > > >> > > >>
>> > > > > >> > > >>
>> > > > > >> > >
>> > > > > >> >
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
>> > > > > >> > > >>
>> > > > > >> > > >> Best,
>> > > > > >> > > >> Weihua
>> > > > > >> > > >>
>> > > > > >> > > >>
>> > > > > >> > >
>> > > > > >> >
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>>
>

Reply via email to