Thanks for the test plan, Weihua!

Yes, it addresses my concerns. 

Thanks,
John

On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote:
> Hi, everyone,
> Thanks for your suggestions and ideas.
> Thanks Xintong for sharing the detailed backgrounds of SlotManager.
>
> *@Matthias
>
> 1. Did you do a proper test coverage analysis?
>
>
> Just as Xintong said, we already have a CI stage for fine grained resource
> managers.
> And I will make sure FineGrainedSlotManager as the default SlotManager can
> pass all the tests of CI.
> In addition, I will review all unit tests of DeclarativeSlotManager(DSM) to
> ensure that there are no gaps in the
> coverage provided by the FineGrainedSlotManager.
> I also added the 'Test Plan' part to the FLIP.
> @Matthias @John @Shammon Does this test plan address your concerns?
>
> 2.  DeclarativeSlotManager and FineGrainedSlotManager feel quite big in
>
> terms of lines of code....
>
>
> IMO, the refactoring of SlotManager does not belong to this FLIP since it
> may lead to some unstable risks. For
> FineGrainedSlotManager(FGSM), we already split some reasonable components.
> They are:
> * TaskManagerTracker: Track task managers and their resources.
> * ResourceTracker: track requirements of jobs
> * ResourceAllocationStrategy: Try to fulfill the resource requirements with
> available/pending resources.
> * SlotStatusSyncer: communicate with TaskManager, for allocating/freeing
> slot and reconciling the slot status
> Maybe we can start a discussion about refactoring SlotManager in another
> FLIP if there are some good suggestions.
> WDYT
>
> 3. For me personally, having a more detailed summary comparing the
>> subcomponents of both SlotManager implementations with where
>> their functionality matches and where they differ might help understand the
>> consequences of the changes proposed in FLIP-298
>
> Good suggestion, I have updated the comparison in this FLIP. Looking
> forward to any suggestions/thoughts
> if they are not described clearly.
>
> *@John
>
> 4. In addition to changing the default, would it make sense to log a
>> deprecation warning on initialization
>
> if the DeclarativeSlotManager is used?
>>
> SGTM, We should add Deprecated annotations to DSM for devs. And log a
> deprecation warning for users.
>
> *@Shammon
>
> 1. For their functional differences, can you give some detailed tests to
>> verify that the new FineGrainedSlotManager has these capabilities? This can
>> effectively verify the new functions
>>
> As just maintained, there is already a CI stage of FGSM, and I will do more
> review of unit tests for DSM.
>
>  2. I'm worried that many functions are not independent and it is difficult
>> to migrate step-by-step. You can list the relationship between them in
>> detail.
>
>  As Xintong saied the DSM is a subset of FGSM by design. But as time goes
> on, FGSM has some lacking
> functions as I listed in this FLIP. And I have added the comparison between
> DSM and FGSM in this FLIP.
>
>
> Thanks again for all your thoughts. Any feedback is appreciated!
>
> Best,
> Weihua
>
>
> On Wed, Mar 1, 2023 at 2:17 PM Xintong Song <tonysong...@gmail.com> wrote:
>
>> Thanks Weihua for preparing this FLIP. +1 for the proposal.
>>
>>
>> As one of the contributors of the fine-grained slot manager, I'd like to
>> share some backgrounds here.
>>
>> - There used to be a defaut slot manager implementation, which is
>> non-declarative and has been removed now. The two features, declarative /
>> reactive resource management and fine-grained resource management, were
>> proposed at about the same time. We were aware that by design the
>> declarative slot manager is a subset of fine-grained slot manager at that
>> time, but still decided to implement two slot managers for the purpose of
>> decoupling efforts and reducing cross-team synchronization overhead.
>> Merging the two slot managers once they are proved stable is IMO a
>> technical debt that was planned at the very beginning.
>>
>> - The FineGrainedSlotManager has been verified in Alibaba's internal
>> production as well as Alibaba Cloud services as the default slot manager
>> for about 2 years.
>>
>>
>> Concerning test cases, we currently have a ci stage for fine grained
>> resource management. To avoid adding too much burden, the stage only
>> includes tests from flink-runtime and flink-test modules. I think switching
>> the default slot manager and applying the whole set of tests on the
>> fine-grained slot manager would help us to be more confident about it.
>>
>>
>> Concerning cutting out functionalities out of slot manager, I think Yangze
>> and I have tried our best to shape the FineGrainedSlotManager into
>> reasonable components. I personally don't have other ideas to further
>> disassemble the component, but I'm open to such suggestions. However, from
>> the stability perspective, I'd be in favor of not introducing significant
>> changes to the FineGrainedSlotManager while switching it to the default.
>> Because the current implementation has already been verified (or at least
>> partially verified because Alibaba does not cover all the Flink use cases),
>> and introducing more changes also means more chances of breaking things.
>>
>>
>> Best,
>>
>> Xintong
>>
>>
>>
>> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zjur...@gmail.com> wrote:
>>
>> > Hi
>> >
>> > Thanks for starting this work weihua, I think unifying
>> > DeclarativeSlotManager and FineGrainedSlotManager is valuable.
>> >
>> > I agree with @Matthias and @John that we need a way to ensure that
>> > DeclarativeSlotManager's capabilities are fully covered by
>> > FineGrainedSlotManager
>> >
>> > 1. For their functional differences, can you give some detailed tests to
>> > verify that the new FineGrainedSlotManager has these capabilities? This
>> can
>> > effectively verify the new functions
>> >
>> > 2. I'm worried that many functions are not independent and it is
>> difficult
>> > to migrate step-by-step. You can list the relationship between them in
>> > detail.
>> >
>> > 3. As John mentioned, give a smoke test for FineGrainedSlotManager is a
>> > good idea. Or you can add some test information to the
>> > DeclarativeSlotManager to determine how many tests have used it. In this
>> > way, we can gradually construct test cases for FineGrainedSlotManager
>> > during the development process.
>> >
>> >
>> > Best,
>> > Shammon
>> >
>> >
>> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler <vvcep...@apache.org>
>> wrote:
>> >
>> > > Thanks for the FLIP, Weihua!
>> > >
>> > > I’ve read the FLIP, and it sounds good to me. We need to avoid
>> > > proliferating alternative implementations wherever possible. I have
>> just
>> > a
>> > > couple of comments:
>> > >
>> > > 1. I share Matthias’s concern about ensuring the behavior is really the
>> > > same. One suggestion I’ve used for this kind of thing is, as a smoke
>> > test,
>> > > to update the DeclarativeSlotManager to just delegate to the
>> > > FineGrainedSlotManager. If the full test suite still passes, you can be
>> > > pretty sure the new default is really ok. It would not be a good idea
>> to
>> > > actually keep that in for the release, since it would remove the option
>> > to
>> > > fall back in case of bugs. Either way, we need to make sure all test
>> > > scenarios are present for the FGSM.
>> > >
>> > > 4. In addition to changing the default, would it make sense to log a
>> > > deprecation warning on initialization if the DeclarativeSlotManager is
>> > used?
>> > >
>> > > Thanks again,
>> > > John
>> > >
>> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote:
>> > > > Hi Weihua,
>> > > > Thanks for your proposal. From a conceptual point: AFAIU, the
>> > > > DeclarativeSlotManager covers a subset (i.e. only evenly sized slots)
>> > of
>> > > > what the FineGrainedSlotManager should be able to achieve (variable
>> > slot
>> > > > size per task manager). Is this the right assumption/understanding?
>> In
>> > > this
>> > > > sense, merging both implementations into a single one sounds good. A
>> > few
>> > > > more general comments, though:
>> > > >
>> > > > 1. Did you do a proper test coverage analysis? That's not mentioned
>> in
>> > > the
>> > > > current version of the FLIP. I'm bringing this up because we ran into
>> > the
>> > > > same issue when fixing the flaws that popped up after introducing the
>> > > > multi-component leader election (see FLIP-285 [1]). There is a risk
>> > that
>> > > by
>> > > > removing the legacy code we decrease test coverage because certain
>> > > > test cases that were covered for the legacy classes might not be
>> > > > necessarily covered in the new implementation, yet (see FLINK-30338
>> [2]
>> > > > which covers this issue for the leader election case). Ideally, we
>> > don't
>> > > > want to remove test cases accidentally because they were only
>> > implemented
>> > > > for the DeclarativeSlotManager but missed for the
>> > FineGrainedSlotManager.
>> > > >
>> > > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel quite big
>> in
>> > > > terms of lines of code. Without knowing whether it's actually a
>> > > reasonable
>> > > > thing to do: Instead of just adding more features to the
>> > > > FineGrainedSlotManager, have you thought of cutting out functionality
>> > > into
>> > > > smaller sub-components along this refactoring? Such a step-by-step
>> > > approach
>> > > > might improve the overall codebase and might make reviewing the
>> > > refactoring
>> > > > easier. I did a first pass over the code and struggled to identify
>> code
>> > > > blocks that could be moved out of the SlotManager implementation(s).
>> > > > Therefore, I might be wrong with this proposal. I haven't worked on
>> > this
>> > > > codebase in that detail that it would allow me to come up with a
>> > > judgement
>> > > > call. I wanted to bring it up, anyway, because I'm curious whether
>> that
>> > > > could be an option. There's a comment created by Chesnay (CC'd) in
>> the
>> > > > JavaDoc of TaskExecutorManager [3] indicating something similar. I'm
>> > > > wondering whether he can add some insights here.
>> > > >
>> > > > 3. For me personally, having a more detailed summary comparing the
>> > > > subcomponents of both SlotManager implementations with where
>> > > > their functionality matches and where they differ might help
>> understand
>> > > the
>> > > > consequences of the changes proposed in FLIP-298.
>> > > >
>> > > > Best,
>> > > > Matthias
>> > > >
>> > > > [1]
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
>> > > > [2] https://issues.apache.org/jira/browse/FLINK-30338
>> > > > [3]
>> > > >
>> > >
>> >
>> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
>> > > >
>> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <wang...@163.com> wrote:
>> > > >
>> > > >> This is a good proposal for me, it will make the code of the
>> > SlotManager
>> > > >> more clear.
>> > > >>
>> > > >>
>> > > >>
>> > > >> --
>> > > >>
>> > > >> Best,
>> > > >> Matt Wang
>> > > >>
>> > > >>
>> > > >> ---- Replied Message ----
>> > > >> | From | David Morávek<d...@apache.org> |
>> > > >> | Date | 02/27/2023 22:45 |
>> > > >> | To | <dev@flink.apache.org> |
>> > > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the Implementation of
>> > > >> SlotManager |
>> > > >> Hi Weihua, I still need to dig into the details, but the overall
>> > > sentiment
>> > > >> of this change sounds reasonable.
>> > > >>
>> > > >> Best,
>> > > >> D.
>> > > >>
>> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
>> > > zhanghao.c...@outlook.com>
>> > > >> wrote:
>> > > >>
>> > > >> Thanks for driving this topic. I think this FLIP could help clean up
>> > the
>> > > >> codebase to make it easier to maintain. +1 on it.
>> > > >>
>> > > >> Best,
>> > > >> Zhanghao Chen
>> > > >> ________________________________
>> > > >> From: Weihua Hu <huweihua....@gmail.com>
>> > > >> Sent: Monday, February 27, 2023 20:40
>> > > >> To: dev <dev@flink.apache.org>
>> > > >> Subject: [DISCUSS] FLIP-298: Unifying the Implementation of
>> > SlotManager
>> > > >>
>> > > >> Hi everyone,
>> > > >>
>> > > >> I would like to begin a discussion on FLIP-298: Unifying the
>> > > Implementation
>> > > >> of SlotManager[1]. There are currently two types of SlotManager in
>> > > Flink:
>> > > >> DeclarativeSlotManager and FineGrainedSlotManager.
>> > > FineGrainedSlotManager
>> > > >> should behave as DeclarativeSlotManager if the user does not
>> configure
>> > > the
>> > > >> slot request profile.
>> > > >>
>> > > >> Therefore, this FLIP aims to unify the implementation of SlotManager
>> > in
>> > > >> order to reduce maintenance costs.
>> > > >>
>> > > >> Looking forward to hearing from you.
>> > > >>
>> > > >> [1]
>> > > >>
>> > > >>
>> > > >>
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
>> > > >>
>> > > >> Best,
>> > > >> Weihua
>> > > >>
>> > > >>
>> > >
>> >
>>

Reply via email to