Hi Weihua, Thanks for your clarification, SGTM.
Best regards, Weijie Weihua Hu <huweihua....@gmail.com> 于2023年3月6日周一 11:43写道: > Thanks Weijie. > > Heterogeneous task managers will not be considered in this FLIP since > it does not request heterogeneous resources as you said. > > My first thought is we can adjust the meaning of redundant configuration > to redundant number of per resource type. These can be considered in > detail when we decide to support heterogeneous task managers. > > Best, > Weihua > > > On Sat, Mar 4, 2023 at 1:13 AM weijie guo <guoweijieres...@gmail.com> > wrote: > > > Thanks Weihua for preparing this FLIP. > > > > This FLIP overall looks reasonable to me after updating as suggested by > > Matthias. > > > > I only have one small question about keeping some redundant task > managers: > > In the fine-grained resource management, theoretically, it can support > > heterogeneous taskmanagers. When we complete the missing features for > FGSM, > > do we plan to take this into account? > > Of course, if I remember correctly, FGSM will not request heterogeneous > > resources at present, so it is also acceptable to me if there is no > special > > treatment now. > > > > +1 for this changes if we can ensure the test coverage. > > > > Best regards, > > > > Weijie > > > > > > John Roesler <vvcep...@apache.org> 于2023年3月2日周四 12:53写道: > > > > > Thanks for the test plan, Weihua! > > > > > > Yes, it addresses my concerns. > > > > > > Thanks, > > > John > > > > > > On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote: > > > > Hi, everyone, > > > > Thanks for your suggestions and ideas. > > > > Thanks Xintong for sharing the detailed backgrounds of SlotManager. > > > > > > > > *@Matthias > > > > > > > > 1. Did you do a proper test coverage analysis? > > > > > > > > > > > > Just as Xintong said, we already have a CI stage for fine grained > > > resource > > > > managers. > > > > And I will make sure FineGrainedSlotManager as the default > SlotManager > > > can > > > > pass all the tests of CI. > > > > In addition, I will review all unit tests of > > DeclarativeSlotManager(DSM) > > > to > > > > ensure that there are no gaps in the > > > > coverage provided by the FineGrainedSlotManager. > > > > I also added the 'Test Plan' part to the FLIP. > > > > @Matthias @John @Shammon Does this test plan address your concerns? > > > > > > > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel quite big > in > > > > > > > > terms of lines of code.... > > > > > > > > > > > > IMO, the refactoring of SlotManager does not belong to this FLIP > since > > it > > > > may lead to some unstable risks. For > > > > FineGrainedSlotManager(FGSM), we already split some reasonable > > > components. > > > > They are: > > > > * TaskManagerTracker: Track task managers and their resources. > > > > * ResourceTracker: track requirements of jobs > > > > * ResourceAllocationStrategy: Try to fulfill the resource > requirements > > > with > > > > available/pending resources. > > > > * SlotStatusSyncer: communicate with TaskManager, for > > allocating/freeing > > > > slot and reconciling the slot status > > > > Maybe we can start a discussion about refactoring SlotManager in > > another > > > > FLIP if there are some good suggestions. > > > > WDYT > > > > > > > > 3. For me personally, having a more detailed summary comparing the > > > >> subcomponents of both SlotManager implementations with where > > > >> their functionality matches and where they differ might help > > understand > > > the > > > >> consequences of the changes proposed in FLIP-298 > > > > > > > > Good suggestion, I have updated the comparison in this FLIP. Looking > > > > forward to any suggestions/thoughts > > > > if they are not described clearly. > > > > > > > > *@John > > > > > > > > 4. In addition to changing the default, would it make sense to log a > > > >> deprecation warning on initialization > > > > > > > > if the DeclarativeSlotManager is used? > > > >> > > > > SGTM, We should add Deprecated annotations to DSM for devs. And log a > > > > deprecation warning for users. > > > > > > > > *@Shammon > > > > > > > > 1. For their functional differences, can you give some detailed tests > > to > > > >> verify that the new FineGrainedSlotManager has these capabilities? > > This > > > can > > > >> effectively verify the new functions > > > >> > > > > As just maintained, there is already a CI stage of FGSM, and I will > do > > > more > > > > review of unit tests for DSM. > > > > > > > > 2. I'm worried that many functions are not independent and it is > > > difficult > > > >> to migrate step-by-step. You can list the relationship between them > in > > > >> detail. > > > > > > > > As Xintong saied the DSM is a subset of FGSM by design. But as time > > goes > > > > on, FGSM has some lacking > > > > functions as I listed in this FLIP. And I have added the comparison > > > between > > > > DSM and FGSM in this FLIP. > > > > > > > > > > > > Thanks again for all your thoughts. Any feedback is appreciated! > > > > > > > > Best, > > > > Weihua > > > > > > > > > > > > On Wed, Mar 1, 2023 at 2:17 PM Xintong Song <tonysong...@gmail.com> > > > wrote: > > > > > > > >> Thanks Weihua for preparing this FLIP. +1 for the proposal. > > > >> > > > >> > > > >> As one of the contributors of the fine-grained slot manager, I'd > like > > to > > > >> share some backgrounds here. > > > >> > > > >> - There used to be a defaut slot manager implementation, which is > > > >> non-declarative and has been removed now. The two features, > > declarative > > > / > > > >> reactive resource management and fine-grained resource management, > > were > > > >> proposed at about the same time. We were aware that by design the > > > >> declarative slot manager is a subset of fine-grained slot manager at > > > that > > > >> time, but still decided to implement two slot managers for the > purpose > > > of > > > >> decoupling efforts and reducing cross-team synchronization overhead. > > > >> Merging the two slot managers once they are proved stable is IMO a > > > >> technical debt that was planned at the very beginning. > > > >> > > > >> - The FineGrainedSlotManager has been verified in Alibaba's internal > > > >> production as well as Alibaba Cloud services as the default slot > > manager > > > >> for about 2 years. > > > >> > > > >> > > > >> Concerning test cases, we currently have a ci stage for fine grained > > > >> resource management. To avoid adding too much burden, the stage only > > > >> includes tests from flink-runtime and flink-test modules. I think > > > switching > > > >> the default slot manager and applying the whole set of tests on the > > > >> fine-grained slot manager would help us to be more confident about > it. > > > >> > > > >> > > > >> Concerning cutting out functionalities out of slot manager, I think > > > Yangze > > > >> and I have tried our best to shape the FineGrainedSlotManager into > > > >> reasonable components. I personally don't have other ideas to > further > > > >> disassemble the component, but I'm open to such suggestions. > However, > > > from > > > >> the stability perspective, I'd be in favor of not introducing > > > significant > > > >> changes to the FineGrainedSlotManager while switching it to the > > default. > > > >> Because the current implementation has already been verified (or at > > > least > > > >> partially verified because Alibaba does not cover all the Flink use > > > cases), > > > >> and introducing more changes also means more chances of breaking > > things. > > > >> > > > >> > > > >> Best, > > > >> > > > >> Xintong > > > >> > > > >> > > > >> > > > >> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zjur...@gmail.com> > wrote: > > > >> > > > >> > Hi > > > >> > > > > >> > Thanks for starting this work weihua, I think unifying > > > >> > DeclarativeSlotManager and FineGrainedSlotManager is valuable. > > > >> > > > > >> > I agree with @Matthias and @John that we need a way to ensure that > > > >> > DeclarativeSlotManager's capabilities are fully covered by > > > >> > FineGrainedSlotManager > > > >> > > > > >> > 1. For their functional differences, can you give some detailed > > tests > > > to > > > >> > verify that the new FineGrainedSlotManager has these capabilities? > > > This > > > >> can > > > >> > effectively verify the new functions > > > >> > > > > >> > 2. I'm worried that many functions are not independent and it is > > > >> difficult > > > >> > to migrate step-by-step. You can list the relationship between > them > > in > > > >> > detail. > > > >> > > > > >> > 3. As John mentioned, give a smoke test for FineGrainedSlotManager > > is > > > a > > > >> > good idea. Or you can add some test information to the > > > >> > DeclarativeSlotManager to determine how many tests have used it. > In > > > this > > > >> > way, we can gradually construct test cases for > > FineGrainedSlotManager > > > >> > during the development process. > > > >> > > > > >> > > > > >> > Best, > > > >> > Shammon > > > >> > > > > >> > > > > >> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler < > vvcep...@apache.org> > > > >> wrote: > > > >> > > > > >> > > Thanks for the FLIP, Weihua! > > > >> > > > > > >> > > I’ve read the FLIP, and it sounds good to me. We need to avoid > > > >> > > proliferating alternative implementations wherever possible. I > > have > > > >> just > > > >> > a > > > >> > > couple of comments: > > > >> > > > > > >> > > 1. I share Matthias’s concern about ensuring the behavior is > > really > > > the > > > >> > > same. One suggestion I’ve used for this kind of thing is, as a > > smoke > > > >> > test, > > > >> > > to update the DeclarativeSlotManager to just delegate to the > > > >> > > FineGrainedSlotManager. If the full test suite still passes, you > > > can be > > > >> > > pretty sure the new default is really ok. It would not be a good > > > idea > > > >> to > > > >> > > actually keep that in for the release, since it would remove the > > > option > > > >> > to > > > >> > > fall back in case of bugs. Either way, we need to make sure all > > test > > > >> > > scenarios are present for the FGSM. > > > >> > > > > > >> > > 4. In addition to changing the default, would it make sense to > > log a > > > >> > > deprecation warning on initialization if the > > DeclarativeSlotManager > > > is > > > >> > used? > > > >> > > > > > >> > > Thanks again, > > > >> > > John > > > >> > > > > > >> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote: > > > >> > > > Hi Weihua, > > > >> > > > Thanks for your proposal. From a conceptual point: AFAIU, the > > > >> > > > DeclarativeSlotManager covers a subset (i.e. only evenly sized > > > slots) > > > >> > of > > > >> > > > what the FineGrainedSlotManager should be able to achieve > > > (variable > > > >> > slot > > > >> > > > size per task manager). Is this the right > > > assumption/understanding? > > > >> In > > > >> > > this > > > >> > > > sense, merging both implementations into a single one sounds > > > good. A > > > >> > few > > > >> > > > more general comments, though: > > > >> > > > > > > >> > > > 1. Did you do a proper test coverage analysis? That's not > > > mentioned > > > >> in > > > >> > > the > > > >> > > > current version of the FLIP. I'm bringing this up because we > ran > > > into > > > >> > the > > > >> > > > same issue when fixing the flaws that popped up after > > introducing > > > the > > > >> > > > multi-component leader election (see FLIP-285 [1]). There is a > > > risk > > > >> > that > > > >> > > by > > > >> > > > removing the legacy code we decrease test coverage because > > certain > > > >> > > > test cases that were covered for the legacy classes might not > be > > > >> > > > necessarily covered in the new implementation, yet (see > > > FLINK-30338 > > > >> [2] > > > >> > > > which covers this issue for the leader election case). > Ideally, > > we > > > >> > don't > > > >> > > > want to remove test cases accidentally because they were only > > > >> > implemented > > > >> > > > for the DeclarativeSlotManager but missed for the > > > >> > FineGrainedSlotManager. > > > >> > > > > > > >> > > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel > quite > > > big > > > >> in > > > >> > > > terms of lines of code. Without knowing whether it's actually > a > > > >> > > reasonable > > > >> > > > thing to do: Instead of just adding more features to the > > > >> > > > FineGrainedSlotManager, have you thought of cutting out > > > functionality > > > >> > > into > > > >> > > > smaller sub-components along this refactoring? Such a > > step-by-step > > > >> > > approach > > > >> > > > might improve the overall codebase and might make reviewing > the > > > >> > > refactoring > > > >> > > > easier. I did a first pass over the code and struggled to > > identify > > > >> code > > > >> > > > blocks that could be moved out of the SlotManager > > > implementation(s). > > > >> > > > Therefore, I might be wrong with this proposal. I haven't > worked > > > on > > > >> > this > > > >> > > > codebase in that detail that it would allow me to come up > with a > > > >> > > judgement > > > >> > > > call. I wanted to bring it up, anyway, because I'm curious > > whether > > > >> that > > > >> > > > could be an option. There's a comment created by Chesnay > (CC'd) > > in > > > >> the > > > >> > > > JavaDoc of TaskExecutorManager [3] indicating something > similar. > > > I'm > > > >> > > > wondering whether he can add some insights here. > > > >> > > > > > > >> > > > 3. For me personally, having a more detailed summary comparing > > the > > > >> > > > subcomponents of both SlotManager implementations with where > > > >> > > > their functionality matches and where they differ might help > > > >> understand > > > >> > > the > > > >> > > > consequences of the changes proposed in FLIP-298. > > > >> > > > > > > >> > > > Best, > > > >> > > > Matthias > > > >> > > > > > > >> > > > [1] > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box > > > >> > > > [2] https://issues.apache.org/jira/browse/FLINK-30338 > > > >> > > > [3] > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68 > > > >> > > > > > > >> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <wang...@163.com> > > > wrote: > > > >> > > > > > > >> > > >> This is a good proposal for me, it will make the code of the > > > >> > SlotManager > > > >> > > >> more clear. > > > >> > > >> > > > >> > > >> > > > >> > > >> > > > >> > > >> -- > > > >> > > >> > > > >> > > >> Best, > > > >> > > >> Matt Wang > > > >> > > >> > > > >> > > >> > > > >> > > >> ---- Replied Message ---- > > > >> > > >> | From | David Morávek<d...@apache.org> | > > > >> > > >> | Date | 02/27/2023 22:45 | > > > >> > > >> | To | <dev@flink.apache.org> | > > > >> > > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the > Implementation > > > of > > > >> > > >> SlotManager | > > > >> > > >> Hi Weihua, I still need to dig into the details, but the > > overall > > > >> > > sentiment > > > >> > > >> of this change sounds reasonable. > > > >> > > >> > > > >> > > >> Best, > > > >> > > >> D. > > > >> > > >> > > > >> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen < > > > >> > > zhanghao.c...@outlook.com> > > > >> > > >> wrote: > > > >> > > >> > > > >> > > >> Thanks for driving this topic. I think this FLIP could help > > > clean up > > > >> > the > > > >> > > >> codebase to make it easier to maintain. +1 on it. > > > >> > > >> > > > >> > > >> Best, > > > >> > > >> Zhanghao Chen > > > >> > > >> ________________________________ > > > >> > > >> From: Weihua Hu <huweihua....@gmail.com> > > > >> > > >> Sent: Monday, February 27, 2023 20:40 > > > >> > > >> To: dev <dev@flink.apache.org> > > > >> > > >> Subject: [DISCUSS] FLIP-298: Unifying the Implementation of > > > >> > SlotManager > > > >> > > >> > > > >> > > >> Hi everyone, > > > >> > > >> > > > >> > > >> I would like to begin a discussion on FLIP-298: Unifying the > > > >> > > Implementation > > > >> > > >> of SlotManager[1]. There are currently two types of > SlotManager > > > in > > > >> > > Flink: > > > >> > > >> DeclarativeSlotManager and FineGrainedSlotManager. > > > >> > > FineGrainedSlotManager > > > >> > > >> should behave as DeclarativeSlotManager if the user does not > > > >> configure > > > >> > > the > > > >> > > >> slot request profile. > > > >> > > >> > > > >> > > >> Therefore, this FLIP aims to unify the implementation of > > > SlotManager > > > >> > in > > > >> > > >> order to reduce maintenance costs. > > > >> > > >> > > > >> > > >> Looking forward to hearing from you. > > > >> > > >> > > > >> > > >> [1] > > > >> > > >> > > > >> > > >> > > > >> > > >> > > > >> > > > > > >> > > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager > > > >> > > >> > > > >> > > >> Best, > > > >> > > >> Weihua > > > >> > > >> > > > >> > > >> > > > >> > > > > > >> > > > > >> > > > > > >