Thanks Weihua for preparing this FLIP. This FLIP overall looks reasonable to me after updating as suggested by Matthias.
I only have one small question about keeping some redundant task managers: In the fine-grained resource management, theoretically, it can support heterogeneous taskmanagers. When we complete the missing features for FGSM, do we plan to take this into account? Of course, if I remember correctly, FGSM will not request heterogeneous resources at present, so it is also acceptable to me if there is no special treatment now. +1 for this changes if we can ensure the test coverage. Best regards, Weijie John Roesler <vvcep...@apache.org> 于2023年3月2日周四 12:53写道: > Thanks for the test plan, Weihua! > > Yes, it addresses my concerns. > > Thanks, > John > > On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote: > > Hi, everyone, > > Thanks for your suggestions and ideas. > > Thanks Xintong for sharing the detailed backgrounds of SlotManager. > > > > *@Matthias > > > > 1. Did you do a proper test coverage analysis? > > > > > > Just as Xintong said, we already have a CI stage for fine grained > resource > > managers. > > And I will make sure FineGrainedSlotManager as the default SlotManager > can > > pass all the tests of CI. > > In addition, I will review all unit tests of DeclarativeSlotManager(DSM) > to > > ensure that there are no gaps in the > > coverage provided by the FineGrainedSlotManager. > > I also added the 'Test Plan' part to the FLIP. > > @Matthias @John @Shammon Does this test plan address your concerns? > > > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel quite big in > > > > terms of lines of code.... > > > > > > IMO, the refactoring of SlotManager does not belong to this FLIP since it > > may lead to some unstable risks. For > > FineGrainedSlotManager(FGSM), we already split some reasonable > components. > > They are: > > * TaskManagerTracker: Track task managers and their resources. > > * ResourceTracker: track requirements of jobs > > * ResourceAllocationStrategy: Try to fulfill the resource requirements > with > > available/pending resources. > > * SlotStatusSyncer: communicate with TaskManager, for allocating/freeing > > slot and reconciling the slot status > > Maybe we can start a discussion about refactoring SlotManager in another > > FLIP if there are some good suggestions. > > WDYT > > > > 3. For me personally, having a more detailed summary comparing the > >> subcomponents of both SlotManager implementations with where > >> their functionality matches and where they differ might help understand > the > >> consequences of the changes proposed in FLIP-298 > > > > Good suggestion, I have updated the comparison in this FLIP. Looking > > forward to any suggestions/thoughts > > if they are not described clearly. > > > > *@John > > > > 4. In addition to changing the default, would it make sense to log a > >> deprecation warning on initialization > > > > if the DeclarativeSlotManager is used? > >> > > SGTM, We should add Deprecated annotations to DSM for devs. And log a > > deprecation warning for users. > > > > *@Shammon > > > > 1. For their functional differences, can you give some detailed tests to > >> verify that the new FineGrainedSlotManager has these capabilities? This > can > >> effectively verify the new functions > >> > > As just maintained, there is already a CI stage of FGSM, and I will do > more > > review of unit tests for DSM. > > > > 2. I'm worried that many functions are not independent and it is > difficult > >> to migrate step-by-step. You can list the relationship between them in > >> detail. > > > > As Xintong saied the DSM is a subset of FGSM by design. But as time goes > > on, FGSM has some lacking > > functions as I listed in this FLIP. And I have added the comparison > between > > DSM and FGSM in this FLIP. > > > > > > Thanks again for all your thoughts. Any feedback is appreciated! > > > > Best, > > Weihua > > > > > > On Wed, Mar 1, 2023 at 2:17 PM Xintong Song <tonysong...@gmail.com> > wrote: > > > >> Thanks Weihua for preparing this FLIP. +1 for the proposal. > >> > >> > >> As one of the contributors of the fine-grained slot manager, I'd like to > >> share some backgrounds here. > >> > >> - There used to be a defaut slot manager implementation, which is > >> non-declarative and has been removed now. The two features, declarative > / > >> reactive resource management and fine-grained resource management, were > >> proposed at about the same time. We were aware that by design the > >> declarative slot manager is a subset of fine-grained slot manager at > that > >> time, but still decided to implement two slot managers for the purpose > of > >> decoupling efforts and reducing cross-team synchronization overhead. > >> Merging the two slot managers once they are proved stable is IMO a > >> technical debt that was planned at the very beginning. > >> > >> - The FineGrainedSlotManager has been verified in Alibaba's internal > >> production as well as Alibaba Cloud services as the default slot manager > >> for about 2 years. > >> > >> > >> Concerning test cases, we currently have a ci stage for fine grained > >> resource management. To avoid adding too much burden, the stage only > >> includes tests from flink-runtime and flink-test modules. I think > switching > >> the default slot manager and applying the whole set of tests on the > >> fine-grained slot manager would help us to be more confident about it. > >> > >> > >> Concerning cutting out functionalities out of slot manager, I think > Yangze > >> and I have tried our best to shape the FineGrainedSlotManager into > >> reasonable components. I personally don't have other ideas to further > >> disassemble the component, but I'm open to such suggestions. However, > from > >> the stability perspective, I'd be in favor of not introducing > significant > >> changes to the FineGrainedSlotManager while switching it to the default. > >> Because the current implementation has already been verified (or at > least > >> partially verified because Alibaba does not cover all the Flink use > cases), > >> and introducing more changes also means more chances of breaking things. > >> > >> > >> Best, > >> > >> Xintong > >> > >> > >> > >> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zjur...@gmail.com> wrote: > >> > >> > Hi > >> > > >> > Thanks for starting this work weihua, I think unifying > >> > DeclarativeSlotManager and FineGrainedSlotManager is valuable. > >> > > >> > I agree with @Matthias and @John that we need a way to ensure that > >> > DeclarativeSlotManager's capabilities are fully covered by > >> > FineGrainedSlotManager > >> > > >> > 1. For their functional differences, can you give some detailed tests > to > >> > verify that the new FineGrainedSlotManager has these capabilities? > This > >> can > >> > effectively verify the new functions > >> > > >> > 2. I'm worried that many functions are not independent and it is > >> difficult > >> > to migrate step-by-step. You can list the relationship between them in > >> > detail. > >> > > >> > 3. As John mentioned, give a smoke test for FineGrainedSlotManager is > a > >> > good idea. Or you can add some test information to the > >> > DeclarativeSlotManager to determine how many tests have used it. In > this > >> > way, we can gradually construct test cases for FineGrainedSlotManager > >> > during the development process. > >> > > >> > > >> > Best, > >> > Shammon > >> > > >> > > >> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler <vvcep...@apache.org> > >> wrote: > >> > > >> > > Thanks for the FLIP, Weihua! > >> > > > >> > > I’ve read the FLIP, and it sounds good to me. We need to avoid > >> > > proliferating alternative implementations wherever possible. I have > >> just > >> > a > >> > > couple of comments: > >> > > > >> > > 1. I share Matthias’s concern about ensuring the behavior is really > the > >> > > same. One suggestion I’ve used for this kind of thing is, as a smoke > >> > test, > >> > > to update the DeclarativeSlotManager to just delegate to the > >> > > FineGrainedSlotManager. If the full test suite still passes, you > can be > >> > > pretty sure the new default is really ok. It would not be a good > idea > >> to > >> > > actually keep that in for the release, since it would remove the > option > >> > to > >> > > fall back in case of bugs. Either way, we need to make sure all test > >> > > scenarios are present for the FGSM. > >> > > > >> > > 4. In addition to changing the default, would it make sense to log a > >> > > deprecation warning on initialization if the DeclarativeSlotManager > is > >> > used? > >> > > > >> > > Thanks again, > >> > > John > >> > > > >> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote: > >> > > > Hi Weihua, > >> > > > Thanks for your proposal. From a conceptual point: AFAIU, the > >> > > > DeclarativeSlotManager covers a subset (i.e. only evenly sized > slots) > >> > of > >> > > > what the FineGrainedSlotManager should be able to achieve > (variable > >> > slot > >> > > > size per task manager). Is this the right > assumption/understanding? > >> In > >> > > this > >> > > > sense, merging both implementations into a single one sounds > good. A > >> > few > >> > > > more general comments, though: > >> > > > > >> > > > 1. Did you do a proper test coverage analysis? That's not > mentioned > >> in > >> > > the > >> > > > current version of the FLIP. I'm bringing this up because we ran > into > >> > the > >> > > > same issue when fixing the flaws that popped up after introducing > the > >> > > > multi-component leader election (see FLIP-285 [1]). There is a > risk > >> > that > >> > > by > >> > > > removing the legacy code we decrease test coverage because certain > >> > > > test cases that were covered for the legacy classes might not be > >> > > > necessarily covered in the new implementation, yet (see > FLINK-30338 > >> [2] > >> > > > which covers this issue for the leader election case). Ideally, we > >> > don't > >> > > > want to remove test cases accidentally because they were only > >> > implemented > >> > > > for the DeclarativeSlotManager but missed for the > >> > FineGrainedSlotManager. > >> > > > > >> > > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel quite > big > >> in > >> > > > terms of lines of code. Without knowing whether it's actually a > >> > > reasonable > >> > > > thing to do: Instead of just adding more features to the > >> > > > FineGrainedSlotManager, have you thought of cutting out > functionality > >> > > into > >> > > > smaller sub-components along this refactoring? Such a step-by-step > >> > > approach > >> > > > might improve the overall codebase and might make reviewing the > >> > > refactoring > >> > > > easier. I did a first pass over the code and struggled to identify > >> code > >> > > > blocks that could be moved out of the SlotManager > implementation(s). > >> > > > Therefore, I might be wrong with this proposal. I haven't worked > on > >> > this > >> > > > codebase in that detail that it would allow me to come up with a > >> > > judgement > >> > > > call. I wanted to bring it up, anyway, because I'm curious whether > >> that > >> > > > could be an option. There's a comment created by Chesnay (CC'd) in > >> the > >> > > > JavaDoc of TaskExecutorManager [3] indicating something similar. > I'm > >> > > > wondering whether he can add some insights here. > >> > > > > >> > > > 3. For me personally, having a more detailed summary comparing the > >> > > > subcomponents of both SlotManager implementations with where > >> > > > their functionality matches and where they differ might help > >> understand > >> > > the > >> > > > consequences of the changes proposed in FLIP-298. > >> > > > > >> > > > Best, > >> > > > Matthias > >> > > > > >> > > > [1] > >> > > > > >> > > > >> > > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box > >> > > > [2] https://issues.apache.org/jira/browse/FLINK-30338 > >> > > > [3] > >> > > > > >> > > > >> > > >> > https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68 > >> > > > > >> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <wang...@163.com> > wrote: > >> > > > > >> > > >> This is a good proposal for me, it will make the code of the > >> > SlotManager > >> > > >> more clear. > >> > > >> > >> > > >> > >> > > >> > >> > > >> -- > >> > > >> > >> > > >> Best, > >> > > >> Matt Wang > >> > > >> > >> > > >> > >> > > >> ---- Replied Message ---- > >> > > >> | From | David Morávek<d...@apache.org> | > >> > > >> | Date | 02/27/2023 22:45 | > >> > > >> | To | <dev@flink.apache.org> | > >> > > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the Implementation > of > >> > > >> SlotManager | > >> > > >> Hi Weihua, I still need to dig into the details, but the overall > >> > > sentiment > >> > > >> of this change sounds reasonable. > >> > > >> > >> > > >> Best, > >> > > >> D. > >> > > >> > >> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen < > >> > > zhanghao.c...@outlook.com> > >> > > >> wrote: > >> > > >> > >> > > >> Thanks for driving this topic. I think this FLIP could help > clean up > >> > the > >> > > >> codebase to make it easier to maintain. +1 on it. > >> > > >> > >> > > >> Best, > >> > > >> Zhanghao Chen > >> > > >> ________________________________ > >> > > >> From: Weihua Hu <huweihua....@gmail.com> > >> > > >> Sent: Monday, February 27, 2023 20:40 > >> > > >> To: dev <dev@flink.apache.org> > >> > > >> Subject: [DISCUSS] FLIP-298: Unifying the Implementation of > >> > SlotManager > >> > > >> > >> > > >> Hi everyone, > >> > > >> > >> > > >> I would like to begin a discussion on FLIP-298: Unifying the > >> > > Implementation > >> > > >> of SlotManager[1]. There are currently two types of SlotManager > in > >> > > Flink: > >> > > >> DeclarativeSlotManager and FineGrainedSlotManager. > >> > > FineGrainedSlotManager > >> > > >> should behave as DeclarativeSlotManager if the user does not > >> configure > >> > > the > >> > > >> slot request profile. > >> > > >> > >> > > >> Therefore, this FLIP aims to unify the implementation of > SlotManager > >> > in > >> > > >> order to reduce maintenance costs. > >> > > >> > >> > > >> Looking forward to hearing from you. > >> > > >> > >> > > >> [1] > >> > > >> > >> > > >> > >> > > >> > >> > > > >> > > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager > >> > > >> > >> > > >> Best, > >> > > >> Weihua > >> > > >> > >> > > >> > >> > > > >> > > >> >