Hi, @David @Matthias There are a few days after hearing your thoughts. I would like to know if there are any other concerns about this FLIP.
Best, Weihua On Mon, Mar 6, 2023 at 7:53 PM Weihua Hu <huweihua....@gmail.com> wrote: > > Thanks Shammon, > > I've updated FLIP to add this redundant Task Manager limitation. > > > Best, > Weihua > > > On Mon, Mar 6, 2023 at 5:07 PM Shammon FY <zjur...@gmail.com> wrote: > >> Hi weihua >> >> Can you add content related to `heterogeneous resources` to this FLIP? We >> can record it and consider it in the future. It may be useful for some >> scenarios, such as the combination of streaming and ML. >> >> Best, >> Shammon >> >> >> On Mon, Mar 6, 2023 at 1:39 PM weijie guo <guoweijieres...@gmail.com> >> wrote: >> >> > Hi Weihua, >> > >> > Thanks for your clarification, SGTM. >> > >> > Best regards, >> > >> > Weijie >> > >> > >> > Weihua Hu <huweihua....@gmail.com> 于2023年3月6日周一 11:43写道: >> > >> > > Thanks Weijie. >> > > >> > > Heterogeneous task managers will not be considered in this FLIP since >> > > it does not request heterogeneous resources as you said. >> > > >> > > My first thought is we can adjust the meaning of redundant >> configuration >> > > to redundant number of per resource type. These can be considered in >> > > detail when we decide to support heterogeneous task managers. >> > > >> > > Best, >> > > Weihua >> > > >> > > >> > > On Sat, Mar 4, 2023 at 1:13 AM weijie guo <guoweijieres...@gmail.com> >> > > wrote: >> > > >> > > > Thanks Weihua for preparing this FLIP. >> > > > >> > > > This FLIP overall looks reasonable to me after updating as >> suggested by >> > > > Matthias. >> > > > >> > > > I only have one small question about keeping some redundant task >> > > managers: >> > > > In the fine-grained resource management, theoretically, it can >> support >> > > > heterogeneous taskmanagers. When we complete the missing features >> for >> > > FGSM, >> > > > do we plan to take this into account? >> > > > Of course, if I remember correctly, FGSM will not request >> heterogeneous >> > > > resources at present, so it is also acceptable to me if there is no >> > > special >> > > > treatment now. >> > > > >> > > > +1 for this changes if we can ensure the test coverage. >> > > > >> > > > Best regards, >> > > > >> > > > Weijie >> > > > >> > > > >> > > > John Roesler <vvcep...@apache.org> 于2023年3月2日周四 12:53写道: >> > > > >> > > > > Thanks for the test plan, Weihua! >> > > > > >> > > > > Yes, it addresses my concerns. >> > > > > >> > > > > Thanks, >> > > > > John >> > > > > >> > > > > On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote: >> > > > > > Hi, everyone, >> > > > > > Thanks for your suggestions and ideas. >> > > > > > Thanks Xintong for sharing the detailed backgrounds of >> SlotManager. >> > > > > > >> > > > > > *@Matthias >> > > > > > >> > > > > > 1. Did you do a proper test coverage analysis? >> > > > > > >> > > > > > >> > > > > > Just as Xintong said, we already have a CI stage for fine >> grained >> > > > > resource >> > > > > > managers. >> > > > > > And I will make sure FineGrainedSlotManager as the default >> > > SlotManager >> > > > > can >> > > > > > pass all the tests of CI. >> > > > > > In addition, I will review all unit tests of >> > > > DeclarativeSlotManager(DSM) >> > > > > to >> > > > > > ensure that there are no gaps in the >> > > > > > coverage provided by the FineGrainedSlotManager. >> > > > > > I also added the 'Test Plan' part to the FLIP. >> > > > > > @Matthias @John @Shammon Does this test plan address your >> concerns? >> > > > > > >> > > > > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel quite >> > big >> > > in >> > > > > > >> > > > > > terms of lines of code.... >> > > > > > >> > > > > > >> > > > > > IMO, the refactoring of SlotManager does not belong to this FLIP >> > > since >> > > > it >> > > > > > may lead to some unstable risks. For >> > > > > > FineGrainedSlotManager(FGSM), we already split some reasonable >> > > > > components. >> > > > > > They are: >> > > > > > * TaskManagerTracker: Track task managers and their resources. >> > > > > > * ResourceTracker: track requirements of jobs >> > > > > > * ResourceAllocationStrategy: Try to fulfill the resource >> > > requirements >> > > > > with >> > > > > > available/pending resources. >> > > > > > * SlotStatusSyncer: communicate with TaskManager, for >> > > > allocating/freeing >> > > > > > slot and reconciling the slot status >> > > > > > Maybe we can start a discussion about refactoring SlotManager in >> > > > another >> > > > > > FLIP if there are some good suggestions. >> > > > > > WDYT >> > > > > > >> > > > > > 3. For me personally, having a more detailed summary comparing >> the >> > > > > >> subcomponents of both SlotManager implementations with where >> > > > > >> their functionality matches and where they differ might help >> > > > understand >> > > > > the >> > > > > >> consequences of the changes proposed in FLIP-298 >> > > > > > >> > > > > > Good suggestion, I have updated the comparison in this FLIP. >> > Looking >> > > > > > forward to any suggestions/thoughts >> > > > > > if they are not described clearly. >> > > > > > >> > > > > > *@John >> > > > > > >> > > > > > 4. In addition to changing the default, would it make sense to >> log >> > a >> > > > > >> deprecation warning on initialization >> > > > > > >> > > > > > if the DeclarativeSlotManager is used? >> > > > > >> >> > > > > > SGTM, We should add Deprecated annotations to DSM for devs. And >> > log a >> > > > > > deprecation warning for users. >> > > > > > >> > > > > > *@Shammon >> > > > > > >> > > > > > 1. For their functional differences, can you give some detailed >> > tests >> > > > to >> > > > > >> verify that the new FineGrainedSlotManager has these >> capabilities? >> > > > This >> > > > > can >> > > > > >> effectively verify the new functions >> > > > > >> >> > > > > > As just maintained, there is already a CI stage of FGSM, and I >> will >> > > do >> > > > > more >> > > > > > review of unit tests for DSM. >> > > > > > >> > > > > > 2. I'm worried that many functions are not independent and it >> is >> > > > > difficult >> > > > > >> to migrate step-by-step. You can list the relationship between >> > them >> > > in >> > > > > >> detail. >> > > > > > >> > > > > > As Xintong saied the DSM is a subset of FGSM by design. But as >> > time >> > > > goes >> > > > > > on, FGSM has some lacking >> > > > > > functions as I listed in this FLIP. And I have added the >> comparison >> > > > > between >> > > > > > DSM and FGSM in this FLIP. >> > > > > > >> > > > > > >> > > > > > Thanks again for all your thoughts. Any feedback is appreciated! >> > > > > > >> > > > > > Best, >> > > > > > Weihua >> > > > > > >> > > > > > >> > > > > > On Wed, Mar 1, 2023 at 2:17 PM Xintong Song < >> tonysong...@gmail.com >> > > >> > > > > wrote: >> > > > > > >> > > > > >> Thanks Weihua for preparing this FLIP. +1 for the proposal. >> > > > > >> >> > > > > >> >> > > > > >> As one of the contributors of the fine-grained slot manager, >> I'd >> > > like >> > > > to >> > > > > >> share some backgrounds here. >> > > > > >> >> > > > > >> - There used to be a defaut slot manager implementation, which >> is >> > > > > >> non-declarative and has been removed now. The two features, >> > > > declarative >> > > > > / >> > > > > >> reactive resource management and fine-grained resource >> management, >> > > > were >> > > > > >> proposed at about the same time. We were aware that by design >> the >> > > > > >> declarative slot manager is a subset of fine-grained slot >> manager >> > at >> > > > > that >> > > > > >> time, but still decided to implement two slot managers for the >> > > purpose >> > > > > of >> > > > > >> decoupling efforts and reducing cross-team synchronization >> > overhead. >> > > > > >> Merging the two slot managers once they are proved stable is >> IMO a >> > > > > >> technical debt that was planned at the very beginning. >> > > > > >> >> > > > > >> - The FineGrainedSlotManager has been verified in Alibaba's >> > internal >> > > > > >> production as well as Alibaba Cloud services as the default >> slot >> > > > manager >> > > > > >> for about 2 years. >> > > > > >> >> > > > > >> >> > > > > >> Concerning test cases, we currently have a ci stage for fine >> > grained >> > > > > >> resource management. To avoid adding too much burden, the stage >> > only >> > > > > >> includes tests from flink-runtime and flink-test modules. I >> think >> > > > > switching >> > > > > >> the default slot manager and applying the whole set of tests on >> > the >> > > > > >> fine-grained slot manager would help us to be more confident >> about >> > > it. >> > > > > >> >> > > > > >> >> > > > > >> Concerning cutting out functionalities out of slot manager, I >> > think >> > > > > Yangze >> > > > > >> and I have tried our best to shape the FineGrainedSlotManager >> into >> > > > > >> reasonable components. I personally don't have other ideas to >> > > further >> > > > > >> disassemble the component, but I'm open to such suggestions. >> > > However, >> > > > > from >> > > > > >> the stability perspective, I'd be in favor of not introducing >> > > > > significant >> > > > > >> changes to the FineGrainedSlotManager while switching it to the >> > > > default. >> > > > > >> Because the current implementation has already been verified >> (or >> > at >> > > > > least >> > > > > >> partially verified because Alibaba does not cover all the Flink >> > use >> > > > > cases), >> > > > > >> and introducing more changes also means more chances of >> breaking >> > > > things. >> > > > > >> >> > > > > >> >> > > > > >> Best, >> > > > > >> >> > > > > >> Xintong >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zjur...@gmail.com> >> > > wrote: >> > > > > >> >> > > > > >> > Hi >> > > > > >> > >> > > > > >> > Thanks for starting this work weihua, I think unifying >> > > > > >> > DeclarativeSlotManager and FineGrainedSlotManager is >> valuable. >> > > > > >> > >> > > > > >> > I agree with @Matthias and @John that we need a way to ensure >> > that >> > > > > >> > DeclarativeSlotManager's capabilities are fully covered by >> > > > > >> > FineGrainedSlotManager >> > > > > >> > >> > > > > >> > 1. For their functional differences, can you give some >> detailed >> > > > tests >> > > > > to >> > > > > >> > verify that the new FineGrainedSlotManager has these >> > capabilities? >> > > > > This >> > > > > >> can >> > > > > >> > effectively verify the new functions >> > > > > >> > >> > > > > >> > 2. I'm worried that many functions are not independent and >> it is >> > > > > >> difficult >> > > > > >> > to migrate step-by-step. You can list the relationship >> between >> > > them >> > > > in >> > > > > >> > detail. >> > > > > >> > >> > > > > >> > 3. As John mentioned, give a smoke test for >> > FineGrainedSlotManager >> > > > is >> > > > > a >> > > > > >> > good idea. Or you can add some test information to the >> > > > > >> > DeclarativeSlotManager to determine how many tests have used >> it. >> > > In >> > > > > this >> > > > > >> > way, we can gradually construct test cases for >> > > > FineGrainedSlotManager >> > > > > >> > during the development process. >> > > > > >> > >> > > > > >> > >> > > > > >> > Best, >> > > > > >> > Shammon >> > > > > >> > >> > > > > >> > >> > > > > >> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler < >> > > vvcep...@apache.org> >> > > > > >> wrote: >> > > > > >> > >> > > > > >> > > Thanks for the FLIP, Weihua! >> > > > > >> > > >> > > > > >> > > I’ve read the FLIP, and it sounds good to me. We need to >> avoid >> > > > > >> > > proliferating alternative implementations wherever >> possible. I >> > > > have >> > > > > >> just >> > > > > >> > a >> > > > > >> > > couple of comments: >> > > > > >> > > >> > > > > >> > > 1. I share Matthias’s concern about ensuring the behavior >> is >> > > > really >> > > > > the >> > > > > >> > > same. One suggestion I’ve used for this kind of thing is, >> as a >> > > > smoke >> > > > > >> > test, >> > > > > >> > > to update the DeclarativeSlotManager to just delegate to >> the >> > > > > >> > > FineGrainedSlotManager. If the full test suite still >> passes, >> > you >> > > > > can be >> > > > > >> > > pretty sure the new default is really ok. It would not be a >> > good >> > > > > idea >> > > > > >> to >> > > > > >> > > actually keep that in for the release, since it would >> remove >> > the >> > > > > option >> > > > > >> > to >> > > > > >> > > fall back in case of bugs. Either way, we need to make sure >> > all >> > > > test >> > > > > >> > > scenarios are present for the FGSM. >> > > > > >> > > >> > > > > >> > > 4. In addition to changing the default, would it make >> sense to >> > > > log a >> > > > > >> > > deprecation warning on initialization if the >> > > > DeclarativeSlotManager >> > > > > is >> > > > > >> > used? >> > > > > >> > > >> > > > > >> > > Thanks again, >> > > > > >> > > John >> > > > > >> > > >> > > > > >> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote: >> > > > > >> > > > Hi Weihua, >> > > > > >> > > > Thanks for your proposal. From a conceptual point: AFAIU, >> > the >> > > > > >> > > > DeclarativeSlotManager covers a subset (i.e. only evenly >> > sized >> > > > > slots) >> > > > > >> > of >> > > > > >> > > > what the FineGrainedSlotManager should be able to achieve >> > > > > (variable >> > > > > >> > slot >> > > > > >> > > > size per task manager). Is this the right >> > > > > assumption/understanding? >> > > > > >> In >> > > > > >> > > this >> > > > > >> > > > sense, merging both implementations into a single one >> sounds >> > > > > good. A >> > > > > >> > few >> > > > > >> > > > more general comments, though: >> > > > > >> > > > >> > > > > >> > > > 1. Did you do a proper test coverage analysis? That's not >> > > > > mentioned >> > > > > >> in >> > > > > >> > > the >> > > > > >> > > > current version of the FLIP. I'm bringing this up >> because we >> > > ran >> > > > > into >> > > > > >> > the >> > > > > >> > > > same issue when fixing the flaws that popped up after >> > > > introducing >> > > > > the >> > > > > >> > > > multi-component leader election (see FLIP-285 [1]). There >> > is a >> > > > > risk >> > > > > >> > that >> > > > > >> > > by >> > > > > >> > > > removing the legacy code we decrease test coverage >> because >> > > > certain >> > > > > >> > > > test cases that were covered for the legacy classes might >> > not >> > > be >> > > > > >> > > > necessarily covered in the new implementation, yet (see >> > > > > FLINK-30338 >> > > > > >> [2] >> > > > > >> > > > which covers this issue for the leader election case). >> > > Ideally, >> > > > we >> > > > > >> > don't >> > > > > >> > > > want to remove test cases accidentally because they were >> > only >> > > > > >> > implemented >> > > > > >> > > > for the DeclarativeSlotManager but missed for the >> > > > > >> > FineGrainedSlotManager. >> > > > > >> > > > >> > > > > >> > > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel >> > > quite >> > > > > big >> > > > > >> in >> > > > > >> > > > terms of lines of code. Without knowing whether it's >> > actually >> > > a >> > > > > >> > > reasonable >> > > > > >> > > > thing to do: Instead of just adding more features to the >> > > > > >> > > > FineGrainedSlotManager, have you thought of cutting out >> > > > > functionality >> > > > > >> > > into >> > > > > >> > > > smaller sub-components along this refactoring? Such a >> > > > step-by-step >> > > > > >> > > approach >> > > > > >> > > > might improve the overall codebase and might make >> reviewing >> > > the >> > > > > >> > > refactoring >> > > > > >> > > > easier. I did a first pass over the code and struggled to >> > > > identify >> > > > > >> code >> > > > > >> > > > blocks that could be moved out of the SlotManager >> > > > > implementation(s). >> > > > > >> > > > Therefore, I might be wrong with this proposal. I haven't >> > > worked >> > > > > on >> > > > > >> > this >> > > > > >> > > > codebase in that detail that it would allow me to come up >> > > with a >> > > > > >> > > judgement >> > > > > >> > > > call. I wanted to bring it up, anyway, because I'm >> curious >> > > > whether >> > > > > >> that >> > > > > >> > > > could be an option. There's a comment created by Chesnay >> > > (CC'd) >> > > > in >> > > > > >> the >> > > > > >> > > > JavaDoc of TaskExecutorManager [3] indicating something >> > > similar. >> > > > > I'm >> > > > > >> > > > wondering whether he can add some insights here. >> > > > > >> > > > >> > > > > >> > > > 3. For me personally, having a more detailed summary >> > comparing >> > > > the >> > > > > >> > > > subcomponents of both SlotManager implementations with >> where >> > > > > >> > > > their functionality matches and where they differ might >> help >> > > > > >> understand >> > > > > >> > > the >> > > > > >> > > > consequences of the changes proposed in FLIP-298. >> > > > > >> > > > >> > > > > >> > > > Best, >> > > > > >> > > > Matthias >> > > > > >> > > > >> > > > > >> > > > [1] >> > > > > >> > > > >> > > > > >> > > >> > > > > >> > >> > > > > >> >> > > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box >> > > > > >> > > > [2] https://issues.apache.org/jira/browse/FLINK-30338 >> > > > > >> > > > [3] >> > > > > >> > > > >> > > > > >> > > >> > > > > >> > >> > > > > >> >> > > > > >> > > > >> > > >> > >> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68 >> > > > > >> > > > >> > > > > >> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang < >> wang...@163.com> >> > > > > wrote: >> > > > > >> > > > >> > > > > >> > > >> This is a good proposal for me, it will make the code of >> > the >> > > > > >> > SlotManager >> > > > > >> > > >> more clear. >> > > > > >> > > >> >> > > > > >> > > >> >> > > > > >> > > >> >> > > > > >> > > >> -- >> > > > > >> > > >> >> > > > > >> > > >> Best, >> > > > > >> > > >> Matt Wang >> > > > > >> > > >> >> > > > > >> > > >> >> > > > > >> > > >> ---- Replied Message ---- >> > > > > >> > > >> | From | David Morávek<d...@apache.org> | >> > > > > >> > > >> | Date | 02/27/2023 22:45 | >> > > > > >> > > >> | To | <dev@flink.apache.org> | >> > > > > >> > > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the >> > > Implementation >> > > > > of >> > > > > >> > > >> SlotManager | >> > > > > >> > > >> Hi Weihua, I still need to dig into the details, but the >> > > > overall >> > > > > >> > > sentiment >> > > > > >> > > >> of this change sounds reasonable. >> > > > > >> > > >> >> > > > > >> > > >> Best, >> > > > > >> > > >> D. >> > > > > >> > > >> >> > > > > >> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen < >> > > > > >> > > zhanghao.c...@outlook.com> >> > > > > >> > > >> wrote: >> > > > > >> > > >> >> > > > > >> > > >> Thanks for driving this topic. I think this FLIP could >> help >> > > > > clean up >> > > > > >> > the >> > > > > >> > > >> codebase to make it easier to maintain. +1 on it. >> > > > > >> > > >> >> > > > > >> > > >> Best, >> > > > > >> > > >> Zhanghao Chen >> > > > > >> > > >> ________________________________ >> > > > > >> > > >> From: Weihua Hu <huweihua....@gmail.com> >> > > > > >> > > >> Sent: Monday, February 27, 2023 20:40 >> > > > > >> > > >> To: dev <dev@flink.apache.org> >> > > > > >> > > >> Subject: [DISCUSS] FLIP-298: Unifying the >> Implementation of >> > > > > >> > SlotManager >> > > > > >> > > >> >> > > > > >> > > >> Hi everyone, >> > > > > >> > > >> >> > > > > >> > > >> I would like to begin a discussion on FLIP-298: Unifying >> > the >> > > > > >> > > Implementation >> > > > > >> > > >> of SlotManager[1]. There are currently two types of >> > > SlotManager >> > > > > in >> > > > > >> > > Flink: >> > > > > >> > > >> DeclarativeSlotManager and FineGrainedSlotManager. >> > > > > >> > > FineGrainedSlotManager >> > > > > >> > > >> should behave as DeclarativeSlotManager if the user does >> > not >> > > > > >> configure >> > > > > >> > > the >> > > > > >> > > >> slot request profile. >> > > > > >> > > >> >> > > > > >> > > >> Therefore, this FLIP aims to unify the implementation of >> > > > > SlotManager >> > > > > >> > in >> > > > > >> > > >> order to reduce maintenance costs. >> > > > > >> > > >> >> > > > > >> > > >> Looking forward to hearing from you. >> > > > > >> > > >> >> > > > > >> > > >> [1] >> > > > > >> > > >> >> > > > > >> > > >> >> > > > > >> > > >> >> > > > > >> > > >> > > > > >> > >> > > > > >> >> > > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager >> > > > > >> > > >> >> > > > > >> > > >> Best, >> > > > > >> > > >> Weihua >> > > > > >> > > >> >> > > > > >> > > >> >> > > > > >> > > >> > > > > >> > >> > > > > >> >> > > > > >> > > > >> > > >> > >> >