Thanks for the test plan, Weihua! Yes, it addresses my concerns.
Thanks, John On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote: > Hi, everyone, > Thanks for your suggestions and ideas. > Thanks Xintong for sharing the detailed backgrounds of SlotManager. > > *@Matthias > > 1. Did you do a proper test coverage analysis? > > > Just as Xintong said, we already have a CI stage for fine grained resource > managers. > And I will make sure FineGrainedSlotManager as the default SlotManager can > pass all the tests of CI. > In addition, I will review all unit tests of DeclarativeSlotManager(DSM) to > ensure that there are no gaps in the > coverage provided by the FineGrainedSlotManager. > I also added the 'Test Plan' part to the FLIP. > @Matthias @John @Shammon Does this test plan address your concerns? > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel quite big in > > terms of lines of code.... > > > IMO, the refactoring of SlotManager does not belong to this FLIP since it > may lead to some unstable risks. For > FineGrainedSlotManager(FGSM), we already split some reasonable components. > They are: > * TaskManagerTracker: Track task managers and their resources. > * ResourceTracker: track requirements of jobs > * ResourceAllocationStrategy: Try to fulfill the resource requirements with > available/pending resources. > * SlotStatusSyncer: communicate with TaskManager, for allocating/freeing > slot and reconciling the slot status > Maybe we can start a discussion about refactoring SlotManager in another > FLIP if there are some good suggestions. > WDYT > > 3. For me personally, having a more detailed summary comparing the >> subcomponents of both SlotManager implementations with where >> their functionality matches and where they differ might help understand the >> consequences of the changes proposed in FLIP-298 > > Good suggestion, I have updated the comparison in this FLIP. Looking > forward to any suggestions/thoughts > if they are not described clearly. > > *@John > > 4. In addition to changing the default, would it make sense to log a >> deprecation warning on initialization > > if the DeclarativeSlotManager is used? >> > SGTM, We should add Deprecated annotations to DSM for devs. And log a > deprecation warning for users. > > *@Shammon > > 1. For their functional differences, can you give some detailed tests to >> verify that the new FineGrainedSlotManager has these capabilities? This can >> effectively verify the new functions >> > As just maintained, there is already a CI stage of FGSM, and I will do more > review of unit tests for DSM. > > 2. I'm worried that many functions are not independent and it is difficult >> to migrate step-by-step. You can list the relationship between them in >> detail. > > As Xintong saied the DSM is a subset of FGSM by design. But as time goes > on, FGSM has some lacking > functions as I listed in this FLIP. And I have added the comparison between > DSM and FGSM in this FLIP. > > > Thanks again for all your thoughts. Any feedback is appreciated! > > Best, > Weihua > > > On Wed, Mar 1, 2023 at 2:17 PM Xintong Song <tonysong...@gmail.com> wrote: > >> Thanks Weihua for preparing this FLIP. +1 for the proposal. >> >> >> As one of the contributors of the fine-grained slot manager, I'd like to >> share some backgrounds here. >> >> - There used to be a defaut slot manager implementation, which is >> non-declarative and has been removed now. The two features, declarative / >> reactive resource management and fine-grained resource management, were >> proposed at about the same time. We were aware that by design the >> declarative slot manager is a subset of fine-grained slot manager at that >> time, but still decided to implement two slot managers for the purpose of >> decoupling efforts and reducing cross-team synchronization overhead. >> Merging the two slot managers once they are proved stable is IMO a >> technical debt that was planned at the very beginning. >> >> - The FineGrainedSlotManager has been verified in Alibaba's internal >> production as well as Alibaba Cloud services as the default slot manager >> for about 2 years. >> >> >> Concerning test cases, we currently have a ci stage for fine grained >> resource management. To avoid adding too much burden, the stage only >> includes tests from flink-runtime and flink-test modules. I think switching >> the default slot manager and applying the whole set of tests on the >> fine-grained slot manager would help us to be more confident about it. >> >> >> Concerning cutting out functionalities out of slot manager, I think Yangze >> and I have tried our best to shape the FineGrainedSlotManager into >> reasonable components. I personally don't have other ideas to further >> disassemble the component, but I'm open to such suggestions. However, from >> the stability perspective, I'd be in favor of not introducing significant >> changes to the FineGrainedSlotManager while switching it to the default. >> Because the current implementation has already been verified (or at least >> partially verified because Alibaba does not cover all the Flink use cases), >> and introducing more changes also means more chances of breaking things. >> >> >> Best, >> >> Xintong >> >> >> >> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zjur...@gmail.com> wrote: >> >> > Hi >> > >> > Thanks for starting this work weihua, I think unifying >> > DeclarativeSlotManager and FineGrainedSlotManager is valuable. >> > >> > I agree with @Matthias and @John that we need a way to ensure that >> > DeclarativeSlotManager's capabilities are fully covered by >> > FineGrainedSlotManager >> > >> > 1. For their functional differences, can you give some detailed tests to >> > verify that the new FineGrainedSlotManager has these capabilities? This >> can >> > effectively verify the new functions >> > >> > 2. I'm worried that many functions are not independent and it is >> difficult >> > to migrate step-by-step. You can list the relationship between them in >> > detail. >> > >> > 3. As John mentioned, give a smoke test for FineGrainedSlotManager is a >> > good idea. Or you can add some test information to the >> > DeclarativeSlotManager to determine how many tests have used it. In this >> > way, we can gradually construct test cases for FineGrainedSlotManager >> > during the development process. >> > >> > >> > Best, >> > Shammon >> > >> > >> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler <vvcep...@apache.org> >> wrote: >> > >> > > Thanks for the FLIP, Weihua! >> > > >> > > I’ve read the FLIP, and it sounds good to me. We need to avoid >> > > proliferating alternative implementations wherever possible. I have >> just >> > a >> > > couple of comments: >> > > >> > > 1. I share Matthias’s concern about ensuring the behavior is really the >> > > same. One suggestion I’ve used for this kind of thing is, as a smoke >> > test, >> > > to update the DeclarativeSlotManager to just delegate to the >> > > FineGrainedSlotManager. If the full test suite still passes, you can be >> > > pretty sure the new default is really ok. It would not be a good idea >> to >> > > actually keep that in for the release, since it would remove the option >> > to >> > > fall back in case of bugs. Either way, we need to make sure all test >> > > scenarios are present for the FGSM. >> > > >> > > 4. In addition to changing the default, would it make sense to log a >> > > deprecation warning on initialization if the DeclarativeSlotManager is >> > used? >> > > >> > > Thanks again, >> > > John >> > > >> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote: >> > > > Hi Weihua, >> > > > Thanks for your proposal. From a conceptual point: AFAIU, the >> > > > DeclarativeSlotManager covers a subset (i.e. only evenly sized slots) >> > of >> > > > what the FineGrainedSlotManager should be able to achieve (variable >> > slot >> > > > size per task manager). Is this the right assumption/understanding? >> In >> > > this >> > > > sense, merging both implementations into a single one sounds good. A >> > few >> > > > more general comments, though: >> > > > >> > > > 1. Did you do a proper test coverage analysis? That's not mentioned >> in >> > > the >> > > > current version of the FLIP. I'm bringing this up because we ran into >> > the >> > > > same issue when fixing the flaws that popped up after introducing the >> > > > multi-component leader election (see FLIP-285 [1]). There is a risk >> > that >> > > by >> > > > removing the legacy code we decrease test coverage because certain >> > > > test cases that were covered for the legacy classes might not be >> > > > necessarily covered in the new implementation, yet (see FLINK-30338 >> [2] >> > > > which covers this issue for the leader election case). Ideally, we >> > don't >> > > > want to remove test cases accidentally because they were only >> > implemented >> > > > for the DeclarativeSlotManager but missed for the >> > FineGrainedSlotManager. >> > > > >> > > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel quite big >> in >> > > > terms of lines of code. Without knowing whether it's actually a >> > > reasonable >> > > > thing to do: Instead of just adding more features to the >> > > > FineGrainedSlotManager, have you thought of cutting out functionality >> > > into >> > > > smaller sub-components along this refactoring? Such a step-by-step >> > > approach >> > > > might improve the overall codebase and might make reviewing the >> > > refactoring >> > > > easier. I did a first pass over the code and struggled to identify >> code >> > > > blocks that could be moved out of the SlotManager implementation(s). >> > > > Therefore, I might be wrong with this proposal. I haven't worked on >> > this >> > > > codebase in that detail that it would allow me to come up with a >> > > judgement >> > > > call. I wanted to bring it up, anyway, because I'm curious whether >> that >> > > > could be an option. There's a comment created by Chesnay (CC'd) in >> the >> > > > JavaDoc of TaskExecutorManager [3] indicating something similar. I'm >> > > > wondering whether he can add some insights here. >> > > > >> > > > 3. For me personally, having a more detailed summary comparing the >> > > > subcomponents of both SlotManager implementations with where >> > > > their functionality matches and where they differ might help >> understand >> > > the >> > > > consequences of the changes proposed in FLIP-298. >> > > > >> > > > Best, >> > > > Matthias >> > > > >> > > > [1] >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box >> > > > [2] https://issues.apache.org/jira/browse/FLINK-30338 >> > > > [3] >> > > > >> > > >> > >> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68 >> > > > >> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <wang...@163.com> wrote: >> > > > >> > > >> This is a good proposal for me, it will make the code of the >> > SlotManager >> > > >> more clear. >> > > >> >> > > >> >> > > >> >> > > >> -- >> > > >> >> > > >> Best, >> > > >> Matt Wang >> > > >> >> > > >> >> > > >> ---- Replied Message ---- >> > > >> | From | David Morávek<d...@apache.org> | >> > > >> | Date | 02/27/2023 22:45 | >> > > >> | To | <dev@flink.apache.org> | >> > > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the Implementation of >> > > >> SlotManager | >> > > >> Hi Weihua, I still need to dig into the details, but the overall >> > > sentiment >> > > >> of this change sounds reasonable. >> > > >> >> > > >> Best, >> > > >> D. >> > > >> >> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen < >> > > zhanghao.c...@outlook.com> >> > > >> wrote: >> > > >> >> > > >> Thanks for driving this topic. I think this FLIP could help clean up >> > the >> > > >> codebase to make it easier to maintain. +1 on it. >> > > >> >> > > >> Best, >> > > >> Zhanghao Chen >> > > >> ________________________________ >> > > >> From: Weihua Hu <huweihua....@gmail.com> >> > > >> Sent: Monday, February 27, 2023 20:40 >> > > >> To: dev <dev@flink.apache.org> >> > > >> Subject: [DISCUSS] FLIP-298: Unifying the Implementation of >> > SlotManager >> > > >> >> > > >> Hi everyone, >> > > >> >> > > >> I would like to begin a discussion on FLIP-298: Unifying the >> > > Implementation >> > > >> of SlotManager[1]. There are currently two types of SlotManager in >> > > Flink: >> > > >> DeclarativeSlotManager and FineGrainedSlotManager. >> > > FineGrainedSlotManager >> > > >> should behave as DeclarativeSlotManager if the user does not >> configure >> > > the >> > > >> slot request profile. >> > > >> >> > > >> Therefore, this FLIP aims to unify the implementation of SlotManager >> > in >> > > >> order to reduce maintenance costs. >> > > >> >> > > >> Looking forward to hearing from you. >> > > >> >> > > >> [1] >> > > >> >> > > >> >> > > >> >> > > >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager >> > > >> >> > > >> Best, >> > > >> Weihua >> > > >> >> > > >> >> > > >> > >>