Hi, Max Thanks for feedback.
SlotManager only releases task managers with no task running on it after an idle timeout. IMO, this behavior could support scale down naturally. There are two ways to trigger scale down: 1. Using adaptive scheduler and release task managers by resource provider(k8s/yarn). Then scheduler will scale down the parallelism to match the less resources. 2. After FLIP-291[1], we could set the parallelism of jobs, this may scale down jobs. In this case, We could add some strategy in SlotPool to release slots. Once SlotPool releases all slots of a task manager, the SlotManager will release this task manager after the idle timeout exceeds. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management Best, Weihua On Fri, Mar 10, 2023 at 10:11 PM Maximilian Michels <m...@apache.org> wrote: > +1 on the proposal. > > I'm wondering about the scale down behavior of the unified > implementation. Does the new unified implementation prioritize > releasing entire task managers in favor of evenly spreading out task > managers? Consider a scale down from parallelism 15 to parallelism 10 > where each task manager has 5 slots. We could either spread out 10 > slots among the 3 task managers, or fit the 10 slots in 2 task > managers and surrender the task manager (at least on k8s). From a cost > saving perspective, the latter would be preferable. > > -Max > > On Thu, Mar 9, 2023 at 4:17 PM Matthias Pohl > <matthias.p...@aiven.io.invalid> wrote: > > > > Thanks for your clarification. I have nothing else to add to the > > discussion. +1 from my side to proceed > > > > On Wed, Mar 8, 2023 at 4:16 AM Weihua Hu <huweihua....@gmail.com> wrote: > > > > > Thanks Yangze for your attention, this would be a great help. > > > > > > And thanks Matthias too. > > > > > > FLIP-156 [1] mentions some incompatibility between fine-grained > resource > > > > management and reactive mode. I assume that this is independent of > the > > > > SlotManager and replacing the DSM with the FGSM wouldn't affect > reactive > > > > mode? > > > > > > Yes. This incompatibility is independent of SlotManager. That means the > > > AdpativeScheduler will always ignore the resource requirement set by > > > slotSharingGroup and declare Unknown ResourceProfile to SlotManager. > > > So, using FGSM as default will not affect reactive mode. > > > > > > About the heterogeneous TaskManager: This is a feature that's also not > > > > supported in the DSM right now, is it? We should state that fact in > the > > > > FLIP if we mentioned that we don't want to implement it for the FSGM. > > > > > > Yes, both DSM and FGSM do not support request heterogeneous > > > TaskManager right now. Heterogeneous will make the resource allocation > > > logic more complicated, such as the resource deadlock if request A > > > allocated > > > the bigger slot B and then request B could not allocate the small slot > A. > > > We > > > need to think more before starting to support the heterogeneous task > > > manager. > > > So, we don't want to implement heterogeneity in this FLIP. > > > > > > Best, > > > Weihua > > > > > > > > > On Wed, Mar 8, 2023 at 12:44 AM Matthias Pohl > > > <matthias.p...@aiven.io.invalid> wrote: > > > > > > > Thanks for updating the FLIP and adding more context to it. > Additionally, > > > > thanks to Xintong and Yangze for offering your expertise here as > > > > contributors to the initial FineGrainedSlotManager implementation. > > > > > > > > The remark on cutting out functionality was only based on some > > > superficial > > > > initial code reading. I cannot come up with a better code structure > > > myself. > > > > Therefore, I'm fine with not refactoring the code as part of this > FLIP. > > > > > > > > The strategies that were proposed around making sure that the > refactoring > > > > is properly backed by tests sound reasonable. My initial concern was > > > based > > > > on the fact that we might have unit test scenarios for the DSM that > are > > > not > > > > covered in the unit tests of the FSGM. In that case, swapping the DSM > > > with > > > > the FSGM might not be good enough. Going over the GSM tests to make > sure > > > > that we're not accidentally deleting test scenarios sounds good to > me. > > > > Thanks, Weihua. > > > > > > > > FLIP-156 [1] mentions some incompatibility between fine-grained > resource > > > > management and reactive mode. I assume that this is independent of > the > > > > SlotManager and replacing the DSM with the FGSM wouldn't affect > reactive > > > > mode? > > > > > > > > About the heterogeneous TaskManager: This is a feature that's also > not > > > > supported in the DSM right now, is it? We should state that fact in > the > > > > FLIP if we mentioned that we don't want to implement it for the FSGM. > > > > > > > > Best, > > > > Matthias > > > > > > > > [1] > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > > > > > > On Tue, Mar 7, 2023 at 8:58 AM Yangze Guo <karma...@gmail.com> > wrote: > > > > > > > > > Hi Weihua, > > > > > > > > > > Thanks for driving this. As Xintong mentioned, this was a technical > > > > > debt from FLIP-56. > > > > > > > > > > The latest version of FLIP sounds good, +1 from my side. As a > > > > > contributor to this component, I'm willing to assist with the > review > > > > > process. Feel free to reach me if you need help. > > > > > > > > > > Best, > > > > > Yangze Guo > > > > > > > > > > On Tue, Mar 7, 2023 at 1:47 PM Weihua Hu <huweihua....@gmail.com> > > > wrote: > > > > > > > > > > > > Hi, > > > > > > > > > > > > @David @Matthias > > > > > > There are a few days after hearing your thoughts. I would like to > > > know > > > > if > > > > > > there are any other concerns about this FLIP. > > > > > > > > > > > > > > > > > > Best, > > > > > > Weihua > > > > > > > > > > > > > > > > > > On Mon, Mar 6, 2023 at 7:53 PM Weihua Hu <huweihua....@gmail.com > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > Thanks Shammon, > > > > > > > > > > > > > > I've updated FLIP to add this redundant Task Manager > limitation. > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > Weihua > > > > > > > > > > > > > > > > > > > > > On Mon, Mar 6, 2023 at 5:07 PM Shammon FY <zjur...@gmail.com> > > > wrote: > > > > > > > > > > > > > >> Hi weihua > > > > > > >> > > > > > > >> Can you add content related to `heterogeneous resources` to > this > > > > > FLIP? We > > > > > > >> can record it and consider it in the future. It may be useful > for > > > > some > > > > > > >> scenarios, such as the combination of streaming and ML. > > > > > > >> > > > > > > >> Best, > > > > > > >> Shammon > > > > > > >> > > > > > > >> > > > > > > >> On Mon, Mar 6, 2023 at 1:39 PM weijie guo < > > > > guoweijieres...@gmail.com> > > > > > > >> wrote: > > > > > > >> > > > > > > >> > Hi Weihua, > > > > > > >> > > > > > > > >> > Thanks for your clarification, SGTM. > > > > > > >> > > > > > > > >> > Best regards, > > > > > > >> > > > > > > > >> > Weijie > > > > > > >> > > > > > > > >> > > > > > > > >> > Weihua Hu <huweihua....@gmail.com> 于2023年3月6日周一 11:43写道: > > > > > > >> > > > > > > > >> > > Thanks Weijie. > > > > > > >> > > > > > > > > >> > > Heterogeneous task managers will not be considered in this > > > FLIP > > > > > since > > > > > > >> > > it does not request heterogeneous resources as you said. > > > > > > >> > > > > > > > > >> > > My first thought is we can adjust the meaning of redundant > > > > > > >> configuration > > > > > > >> > > to redundant number of per resource type. These can be > > > > considered > > > > > in > > > > > > >> > > detail when we decide to support heterogeneous task > managers. > > > > > > >> > > > > > > > > >> > > Best, > > > > > > >> > > Weihua > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > On Sat, Mar 4, 2023 at 1:13 AM weijie guo < > > > > > guoweijieres...@gmail.com> > > > > > > >> > > wrote: > > > > > > >> > > > > > > > > >> > > > Thanks Weihua for preparing this FLIP. > > > > > > >> > > > > > > > > > >> > > > This FLIP overall looks reasonable to me after updating > as > > > > > > >> suggested by > > > > > > >> > > > Matthias. > > > > > > >> > > > > > > > > > >> > > > I only have one small question about keeping some > redundant > > > > task > > > > > > >> > > managers: > > > > > > >> > > > In the fine-grained resource management, theoretically, > it > > > can > > > > > > >> support > > > > > > >> > > > heterogeneous taskmanagers. When we complete the missing > > > > > features > > > > > > >> for > > > > > > >> > > FGSM, > > > > > > >> > > > do we plan to take this into account? > > > > > > >> > > > Of course, if I remember correctly, FGSM will not > request > > > > > > >> heterogeneous > > > > > > >> > > > resources at present, so it is also acceptable to me if > > > there > > > > > is no > > > > > > >> > > special > > > > > > >> > > > treatment now. > > > > > > >> > > > > > > > > > >> > > > +1 for this changes if we can ensure the test coverage. > > > > > > >> > > > > > > > > > >> > > > Best regards, > > > > > > >> > > > > > > > > > >> > > > Weijie > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > John Roesler <vvcep...@apache.org> 于2023年3月2日周四 > 12:53写道: > > > > > > >> > > > > > > > > > >> > > > > Thanks for the test plan, Weihua! > > > > > > >> > > > > > > > > > > >> > > > > Yes, it addresses my concerns. > > > > > > >> > > > > > > > > > > >> > > > > Thanks, > > > > > > >> > > > > John > > > > > > >> > > > > > > > > > > >> > > > > On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote: > > > > > > >> > > > > > Hi, everyone, > > > > > > >> > > > > > Thanks for your suggestions and ideas. > > > > > > >> > > > > > Thanks Xintong for sharing the detailed backgrounds > of > > > > > > >> SlotManager. > > > > > > >> > > > > > > > > > > > >> > > > > > *@Matthias > > > > > > >> > > > > > > > > > > > >> > > > > > 1. Did you do a proper test coverage analysis? > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > Just as Xintong said, we already have a CI stage for > > > fine > > > > > > >> grained > > > > > > >> > > > > resource > > > > > > >> > > > > > managers. > > > > > > >> > > > > > And I will make sure FineGrainedSlotManager as the > > > default > > > > > > >> > > SlotManager > > > > > > >> > > > > can > > > > > > >> > > > > > pass all the tests of CI. > > > > > > >> > > > > > In addition, I will review all unit tests of > > > > > > >> > > > DeclarativeSlotManager(DSM) > > > > > > >> > > > > to > > > > > > >> > > > > > ensure that there are no gaps in the > > > > > > >> > > > > > coverage provided by the FineGrainedSlotManager. > > > > > > >> > > > > > I also added the 'Test Plan' part to the FLIP. > > > > > > >> > > > > > @Matthias @John @Shammon Does this test plan address > > > your > > > > > > >> concerns? > > > > > > >> > > > > > > > > > > > >> > > > > > 2. DeclarativeSlotManager and > FineGrainedSlotManager > > > feel > > > > > quite > > > > > > >> > big > > > > > > >> > > in > > > > > > >> > > > > > > > > > > > >> > > > > > terms of lines of code.... > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > IMO, the refactoring of SlotManager does not belong > to > > > > this > > > > > FLIP > > > > > > >> > > since > > > > > > >> > > > it > > > > > > >> > > > > > may lead to some unstable risks. For > > > > > > >> > > > > > FineGrainedSlotManager(FGSM), we already split some > > > > > reasonable > > > > > > >> > > > > components. > > > > > > >> > > > > > They are: > > > > > > >> > > > > > * TaskManagerTracker: Track task managers and their > > > > > resources. > > > > > > >> > > > > > * ResourceTracker: track requirements of jobs > > > > > > >> > > > > > * ResourceAllocationStrategy: Try to fulfill the > > > resource > > > > > > >> > > requirements > > > > > > >> > > > > with > > > > > > >> > > > > > available/pending resources. > > > > > > >> > > > > > * SlotStatusSyncer: communicate with TaskManager, > for > > > > > > >> > > > allocating/freeing > > > > > > >> > > > > > slot and reconciling the slot status > > > > > > >> > > > > > Maybe we can start a discussion about refactoring > > > > > SlotManager in > > > > > > >> > > > another > > > > > > >> > > > > > FLIP if there are some good suggestions. > > > > > > >> > > > > > WDYT > > > > > > >> > > > > > > > > > > > >> > > > > > 3. For me personally, having a more detailed summary > > > > > comparing > > > > > > >> the > > > > > > >> > > > > >> subcomponents of both SlotManager implementations > with > > > > > where > > > > > > >> > > > > >> their functionality matches and where they differ > might > > > > > help > > > > > > >> > > > understand > > > > > > >> > > > > the > > > > > > >> > > > > >> consequences of the changes proposed in FLIP-298 > > > > > > >> > > > > > > > > > > > >> > > > > > Good suggestion, I have updated the comparison in > this > > > > FLIP. > > > > > > >> > Looking > > > > > > >> > > > > > forward to any suggestions/thoughts > > > > > > >> > > > > > if they are not described clearly. > > > > > > >> > > > > > > > > > > > >> > > > > > *@John > > > > > > >> > > > > > > > > > > > >> > > > > > 4. In addition to changing the default, would it > make > > > > sense > > > > > to > > > > > > >> log > > > > > > >> > a > > > > > > >> > > > > >> deprecation warning on initialization > > > > > > >> > > > > > > > > > > > >> > > > > > if the DeclarativeSlotManager is used? > > > > > > >> > > > > >> > > > > > > >> > > > > > SGTM, We should add Deprecated annotations to DSM > for > > > > devs. > > > > > And > > > > > > >> > log a > > > > > > >> > > > > > deprecation warning for users. > > > > > > >> > > > > > > > > > > > >> > > > > > *@Shammon > > > > > > >> > > > > > > > > > > > >> > > > > > 1. For their functional differences, can you give > some > > > > > detailed > > > > > > >> > tests > > > > > > >> > > > to > > > > > > >> > > > > >> verify that the new FineGrainedSlotManager has > these > > > > > > >> capabilities? > > > > > > >> > > > This > > > > > > >> > > > > can > > > > > > >> > > > > >> effectively verify the new functions > > > > > > >> > > > > >> > > > > > > >> > > > > > As just maintained, there is already a CI stage of > FGSM, > > > > > and I > > > > > > >> will > > > > > > >> > > do > > > > > > >> > > > > more > > > > > > >> > > > > > review of unit tests for DSM. > > > > > > >> > > > > > > > > > > > >> > > > > > 2. I'm worried that many functions are not > independent > > > > and > > > > > it > > > > > > >> is > > > > > > >> > > > > difficult > > > > > > >> > > > > >> to migrate step-by-step. You can list the > relationship > > > > > between > > > > > > >> > them > > > > > > >> > > in > > > > > > >> > > > > >> detail. > > > > > > >> > > > > > > > > > > > >> > > > > > As Xintong saied the DSM is a subset of FGSM by > design. > > > > > But as > > > > > > >> > time > > > > > > >> > > > goes > > > > > > >> > > > > > on, FGSM has some lacking > > > > > > >> > > > > > functions as I listed in this FLIP. And I have > added the > > > > > > >> comparison > > > > > > >> > > > > between > > > > > > >> > > > > > DSM and FGSM in this FLIP. > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > Thanks again for all your thoughts. Any feedback is > > > > > appreciated! > > > > > > >> > > > > > > > > > > > >> > > > > > Best, > > > > > > >> > > > > > Weihua > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > On Wed, Mar 1, 2023 at 2:17 PM Xintong Song < > > > > > > >> tonysong...@gmail.com > > > > > > >> > > > > > > > > >> > > > > wrote: > > > > > > >> > > > > > > > > > > > >> > > > > >> Thanks Weihua for preparing this FLIP. +1 for the > > > > proposal. > > > > > > >> > > > > >> > > > > > > >> > > > > >> > > > > > > >> > > > > >> As one of the contributors of the fine-grained slot > > > > > manager, > > > > > > >> I'd > > > > > > >> > > like > > > > > > >> > > > to > > > > > > >> > > > > >> share some backgrounds here. > > > > > > >> > > > > >> > > > > > > >> > > > > >> - There used to be a defaut slot manager > > > implementation, > > > > > which > > > > > > >> is > > > > > > >> > > > > >> non-declarative and has been removed now. The two > > > > features, > > > > > > >> > > > declarative > > > > > > >> > > > > / > > > > > > >> > > > > >> reactive resource management and fine-grained > resource > > > > > > >> management, > > > > > > >> > > > were > > > > > > >> > > > > >> proposed at about the same time. We were aware > that by > > > > > design > > > > > > >> the > > > > > > >> > > > > >> declarative slot manager is a subset of > fine-grained > > > slot > > > > > > >> manager > > > > > > >> > at > > > > > > >> > > > > that > > > > > > >> > > > > >> time, but still decided to implement two slot > managers > > > > for > > > > > the > > > > > > >> > > purpose > > > > > > >> > > > > of > > > > > > >> > > > > >> decoupling efforts and reducing cross-team > > > > synchronization > > > > > > >> > overhead. > > > > > > >> > > > > >> Merging the two slot managers once they are proved > > > stable > > > > > is > > > > > > >> IMO a > > > > > > >> > > > > >> technical debt that was planned at the very > beginning. > > > > > > >> > > > > >> > > > > > > >> > > > > >> - The FineGrainedSlotManager has been verified in > > > > Alibaba's > > > > > > >> > internal > > > > > > >> > > > > >> production as well as Alibaba Cloud services as the > > > > default > > > > > > >> slot > > > > > > >> > > > manager > > > > > > >> > > > > >> for about 2 years. > > > > > > >> > > > > >> > > > > > > >> > > > > >> > > > > > > >> > > > > >> Concerning test cases, we currently have a ci > stage for > > > > > fine > > > > > > >> > grained > > > > > > >> > > > > >> resource management. To avoid adding too much > burden, > > > the > > > > > stage > > > > > > >> > only > > > > > > >> > > > > >> includes tests from flink-runtime and flink-test > > > > modules. I > > > > > > >> think > > > > > > >> > > > > switching > > > > > > >> > > > > >> the default slot manager and applying the whole > set of > > > > > tests on > > > > > > >> > the > > > > > > >> > > > > >> fine-grained slot manager would help us to be more > > > > > confident > > > > > > >> about > > > > > > >> > > it. > > > > > > >> > > > > >> > > > > > > >> > > > > >> > > > > > > >> > > > > >> Concerning cutting out functionalities out of slot > > > > > manager, I > > > > > > >> > think > > > > > > >> > > > > Yangze > > > > > > >> > > > > >> and I have tried our best to shape the > > > > > FineGrainedSlotManager > > > > > > >> into > > > > > > >> > > > > >> reasonable components. I personally don't have > other > > > > ideas > > > > > to > > > > > > >> > > further > > > > > > >> > > > > >> disassemble the component, but I'm open to such > > > > > suggestions. > > > > > > >> > > However, > > > > > > >> > > > > from > > > > > > >> > > > > >> the stability perspective, I'd be in favor of not > > > > > introducing > > > > > > >> > > > > significant > > > > > > >> > > > > >> changes to the FineGrainedSlotManager while > switching > > > it > > > > > to the > > > > > > >> > > > default. > > > > > > >> > > > > >> Because the current implementation has already been > > > > > verified > > > > > > >> (or > > > > > > >> > at > > > > > > >> > > > > least > > > > > > >> > > > > >> partially verified because Alibaba does not cover > all > > > the > > > > > Flink > > > > > > >> > use > > > > > > >> > > > > cases), > > > > > > >> > > > > >> and introducing more changes also means more > chances of > > > > > > >> breaking > > > > > > >> > > > things. > > > > > > >> > > > > >> > > > > > > >> > > > > >> > > > > > > >> > > > > >> Best, > > > > > > >> > > > > >> > > > > > > >> > > > > >> Xintong > > > > > > >> > > > > >> > > > > > > >> > > > > >> > > > > > > >> > > > > >> > > > > > > >> > > > > >> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY < > > > > > zjur...@gmail.com> > > > > > > >> > > wrote: > > > > > > >> > > > > >> > > > > > > >> > > > > >> > Hi > > > > > > >> > > > > >> > > > > > > > >> > > > > >> > Thanks for starting this work weihua, I think > > > unifying > > > > > > >> > > > > >> > DeclarativeSlotManager and > FineGrainedSlotManager is > > > > > > >> valuable. > > > > > > >> > > > > >> > > > > > > > >> > > > > >> > I agree with @Matthias and @John that we need a > way > > > to > > > > > ensure > > > > > > >> > that > > > > > > >> > > > > >> > DeclarativeSlotManager's capabilities are fully > > > covered > > > > > by > > > > > > >> > > > > >> > FineGrainedSlotManager > > > > > > >> > > > > >> > > > > > > > >> > > > > >> > 1. For their functional differences, can you give > > > some > > > > > > >> detailed > > > > > > >> > > > tests > > > > > > >> > > > > to > > > > > > >> > > > > >> > verify that the new FineGrainedSlotManager has > these > > > > > > >> > capabilities? > > > > > > >> > > > > This > > > > > > >> > > > > >> can > > > > > > >> > > > > >> > effectively verify the new functions > > > > > > >> > > > > >> > > > > > > > >> > > > > >> > 2. I'm worried that many functions are not > > > independent > > > > > and > > > > > > >> it is > > > > > > >> > > > > >> difficult > > > > > > >> > > > > >> > to migrate step-by-step. You can list the > > > relationship > > > > > > >> between > > > > > > >> > > them > > > > > > >> > > > in > > > > > > >> > > > > >> > detail. > > > > > > >> > > > > >> > > > > > > > >> > > > > >> > 3. As John mentioned, give a smoke test for > > > > > > >> > FineGrainedSlotManager > > > > > > >> > > > is > > > > > > >> > > > > a > > > > > > >> > > > > >> > good idea. Or you can add some test information > to > > > the > > > > > > >> > > > > >> > DeclarativeSlotManager to determine how many > tests > > > have > > > > > used > > > > > > >> it. > > > > > > >> > > In > > > > > > >> > > > > this > > > > > > >> > > > > >> > way, we can gradually construct test cases for > > > > > > >> > > > FineGrainedSlotManager > > > > > > >> > > > > >> > during the development process. > > > > > > >> > > > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > > > >> > Best, > > > > > > >> > > > > >> > Shammon > > > > > > >> > > > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > > > >> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler < > > > > > > >> > > vvcep...@apache.org> > > > > > > >> > > > > >> wrote: > > > > > > >> > > > > >> > > > > > > > >> > > > > >> > > Thanks for the FLIP, Weihua! > > > > > > >> > > > > >> > > > > > > > > >> > > > > >> > > I’ve read the FLIP, and it sounds good to me. > We > > > need > > > > > to > > > > > > >> avoid > > > > > > >> > > > > >> > > proliferating alternative implementations > wherever > > > > > > >> possible. I > > > > > > >> > > > have > > > > > > >> > > > > >> just > > > > > > >> > > > > >> > a > > > > > > >> > > > > >> > > couple of comments: > > > > > > >> > > > > >> > > > > > > > > >> > > > > >> > > 1. I share Matthias’s concern about ensuring > the > > > > > behavior > > > > > > >> is > > > > > > >> > > > really > > > > > > >> > > > > the > > > > > > >> > > > > >> > > same. One suggestion I’ve used for this kind of > > > thing > > > > > is, > > > > > > >> as a > > > > > > >> > > > smoke > > > > > > >> > > > > >> > test, > > > > > > >> > > > > >> > > to update the DeclarativeSlotManager to just > > > delegate > > > > > to > > > > > > >> the > > > > > > >> > > > > >> > > FineGrainedSlotManager. If the full test suite > > > still > > > > > > >> passes, > > > > > > >> > you > > > > > > >> > > > > can be > > > > > > >> > > > > >> > > pretty sure the new default is really ok. It > would > > > > not > > > > > be a > > > > > > >> > good > > > > > > >> > > > > idea > > > > > > >> > > > > >> to > > > > > > >> > > > > >> > > actually keep that in for the release, since it > > > would > > > > > > >> remove > > > > > > >> > the > > > > > > >> > > > > option > > > > > > >> > > > > >> > to > > > > > > >> > > > > >> > > fall back in case of bugs. Either way, we need > to > > > > make > > > > > sure > > > > > > >> > all > > > > > > >> > > > test > > > > > > >> > > > > >> > > scenarios are present for the FGSM. > > > > > > >> > > > > >> > > > > > > > > >> > > > > >> > > 4. In addition to changing the default, would > it > > > make > > > > > > >> sense to > > > > > > >> > > > log a > > > > > > >> > > > > >> > > deprecation warning on initialization if the > > > > > > >> > > > DeclarativeSlotManager > > > > > > >> > > > > is > > > > > > >> > > > > >> > used? > > > > > > >> > > > > >> > > > > > > > > >> > > > > >> > > Thanks again, > > > > > > >> > > > > >> > > John > > > > > > >> > > > > >> > > > > > > > > >> > > > > >> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl > > > wrote: > > > > > > >> > > > > >> > > > Hi Weihua, > > > > > > >> > > > > >> > > > Thanks for your proposal. From a conceptual > > > point: > > > > > AFAIU, > > > > > > >> > the > > > > > > >> > > > > >> > > > DeclarativeSlotManager covers a subset (i.e. > only > > > > > evenly > > > > > > >> > sized > > > > > > >> > > > > slots) > > > > > > >> > > > > >> > of > > > > > > >> > > > > >> > > > what the FineGrainedSlotManager should be > able to > > > > > achieve > > > > > > >> > > > > (variable > > > > > > >> > > > > >> > slot > > > > > > >> > > > > >> > > > size per task manager). Is this the right > > > > > > >> > > > > assumption/understanding? > > > > > > >> > > > > >> In > > > > > > >> > > > > >> > > this > > > > > > >> > > > > >> > > > sense, merging both implementations into a > single > > > > one > > > > > > >> sounds > > > > > > >> > > > > good. A > > > > > > >> > > > > >> > few > > > > > > >> > > > > >> > > > more general comments, though: > > > > > > >> > > > > >> > > > > > > > > > >> > > > > >> > > > 1. Did you do a proper test coverage > analysis? > > > > > That's not > > > > > > >> > > > > mentioned > > > > > > >> > > > > >> in > > > > > > >> > > > > >> > > the > > > > > > >> > > > > >> > > > current version of the FLIP. I'm bringing > this up > > > > > > >> because we > > > > > > >> > > ran > > > > > > >> > > > > into > > > > > > >> > > > > >> > the > > > > > > >> > > > > >> > > > same issue when fixing the flaws that popped > up > > > > after > > > > > > >> > > > introducing > > > > > > >> > > > > the > > > > > > >> > > > > >> > > > multi-component leader election (see FLIP-285 > > > [1]). > > > > > There > > > > > > >> > is a > > > > > > >> > > > > risk > > > > > > >> > > > > >> > that > > > > > > >> > > > > >> > > by > > > > > > >> > > > > >> > > > removing the legacy code we decrease test > > > coverage > > > > > > >> because > > > > > > >> > > > certain > > > > > > >> > > > > >> > > > test cases that were covered for the legacy > > > classes > > > > > might > > > > > > >> > not > > > > > > >> > > be > > > > > > >> > > > > >> > > > necessarily covered in the new > implementation, > > > yet > > > > > (see > > > > > > >> > > > > FLINK-30338 > > > > > > >> > > > > >> [2] > > > > > > >> > > > > >> > > > which covers this issue for the leader > election > > > > > case). > > > > > > >> > > Ideally, > > > > > > >> > > > we > > > > > > >> > > > > >> > don't > > > > > > >> > > > > >> > > > want to remove test cases accidentally > because > > > they > > > > > were > > > > > > >> > only > > > > > > >> > > > > >> > implemented > > > > > > >> > > > > >> > > > for the DeclarativeSlotManager but missed > for the > > > > > > >> > > > > >> > FineGrainedSlotManager. > > > > > > >> > > > > >> > > > > > > > > > >> > > > > >> > > > 2. DeclarativeSlotManager and > > > > FineGrainedSlotManager > > > > > feel > > > > > > >> > > quite > > > > > > >> > > > > big > > > > > > >> > > > > >> in > > > > > > >> > > > > >> > > > terms of lines of code. Without knowing > whether > > > > it's > > > > > > >> > actually > > > > > > >> > > a > > > > > > >> > > > > >> > > reasonable > > > > > > >> > > > > >> > > > thing to do: Instead of just adding more > features > > > > to > > > > > the > > > > > > >> > > > > >> > > > FineGrainedSlotManager, have you thought of > > > cutting > > > > > out > > > > > > >> > > > > functionality > > > > > > >> > > > > >> > > into > > > > > > >> > > > > >> > > > smaller sub-components along this > refactoring? > > > > Such a > > > > > > >> > > > step-by-step > > > > > > >> > > > > >> > > approach > > > > > > >> > > > > >> > > > might improve the overall codebase and might > make > > > > > > >> reviewing > > > > > > >> > > the > > > > > > >> > > > > >> > > refactoring > > > > > > >> > > > > >> > > > easier. I did a first pass over the code and > > > > > struggled to > > > > > > >> > > > identify > > > > > > >> > > > > >> code > > > > > > >> > > > > >> > > > blocks that could be moved out of the > SlotManager > > > > > > >> > > > > implementation(s). > > > > > > >> > > > > >> > > > Therefore, I might be wrong with this > proposal. I > > > > > haven't > > > > > > >> > > worked > > > > > > >> > > > > on > > > > > > >> > > > > >> > this > > > > > > >> > > > > >> > > > codebase in that detail that it would allow > me to > > > > > come up > > > > > > >> > > with a > > > > > > >> > > > > >> > > judgement > > > > > > >> > > > > >> > > > call. I wanted to bring it up, anyway, > because > > > I'm > > > > > > >> curious > > > > > > >> > > > whether > > > > > > >> > > > > >> that > > > > > > >> > > > > >> > > > could be an option. There's a comment > created by > > > > > Chesnay > > > > > > >> > > (CC'd) > > > > > > >> > > > in > > > > > > >> > > > > >> the > > > > > > >> > > > > >> > > > JavaDoc of TaskExecutorManager [3] indicating > > > > > something > > > > > > >> > > similar. > > > > > > >> > > > > I'm > > > > > > >> > > > > >> > > > wondering whether he can add some insights > here. > > > > > > >> > > > > >> > > > > > > > > > >> > > > > >> > > > 3. For me personally, having a more detailed > > > > summary > > > > > > >> > comparing > > > > > > >> > > > the > > > > > > >> > > > > >> > > > subcomponents of both SlotManager > implementations > > > > > with > > > > > > >> where > > > > > > >> > > > > >> > > > their functionality matches and where they > differ > > > > > might > > > > > > >> help > > > > > > >> > > > > >> understand > > > > > > >> > > > > >> > > the > > > > > > >> > > > > >> > > > consequences of the changes proposed in > FLIP-298. > > > > > > >> > > > > >> > > > > > > > > > >> > > > > >> > > > Best, > > > > > > >> > > > > >> > > > Matthias > > > > > > >> > > > > >> > > > > > > > > > >> > > > > >> > > > [1] > > > > > > >> > > > > >> > > > > > > > > > >> > > > > >> > > > > > > > > >> > > > > >> > > > > > > > >> > > > > >> > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box > > > > > > >> > > > > >> > > > [2] > > > > > https://issues.apache.org/jira/browse/FLINK-30338 > > > > > > >> > > > > >> > > > [3] > > > > > > >> > > > > >> > > > > > > > > > >> > > > > >> > > > > > > > > >> > > > > >> > > > > > > > >> > > > > >> > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68 > > > > > > >> > > > > >> > > > > > > > > > >> > > > > >> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang < > > > > > > >> wang...@163.com> > > > > > > >> > > > > wrote: > > > > > > >> > > > > >> > > > > > > > > > >> > > > > >> > > >> This is a good proposal for me, it will > make the > > > > > code of > > > > > > >> > the > > > > > > >> > > > > >> > SlotManager > > > > > > >> > > > > >> > > >> more clear. > > > > > > >> > > > > >> > > >> > > > > > > >> > > > > >> > > >> > > > > > > >> > > > > >> > > >> > > > > > > >> > > > > >> > > >> -- > > > > > > >> > > > > >> > > >> > > > > > > >> > > > > >> > > >> Best, > > > > > > >> > > > > >> > > >> Matt Wang > > > > > > >> > > > > >> > > >> > > > > > > >> > > > > >> > > >> > > > > > > >> > > > > >> > > >> ---- Replied Message ---- > > > > > > >> > > > > >> > > >> | From | David Morávek<d...@apache.org> | > > > > > > >> > > > > >> > > >> | Date | 02/27/2023 22:45 | > > > > > > >> > > > > >> > > >> | To | <dev@flink.apache.org> | > > > > > > >> > > > > >> > > >> | Subject | Re: [DISCUSS] FLIP-298: > Unifying the > > > > > > >> > > Implementation > > > > > > >> > > > > of > > > > > > >> > > > > >> > > >> SlotManager | > > > > > > >> > > > > >> > > >> Hi Weihua, I still need to dig into the > details, > > > > > but the > > > > > > >> > > > overall > > > > > > >> > > > > >> > > sentiment > > > > > > >> > > > > >> > > >> of this change sounds reasonable. > > > > > > >> > > > > >> > > >> > > > > > > >> > > > > >> > > >> Best, > > > > > > >> > > > > >> > > >> D. > > > > > > >> > > > > >> > > >> > > > > > > >> > > > > >> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao > Chen < > > > > > > >> > > > > >> > > zhanghao.c...@outlook.com> > > > > > > >> > > > > >> > > >> wrote: > > > > > > >> > > > > >> > > >> > > > > > > >> > > > > >> > > >> Thanks for driving this topic. I think this > FLIP > > > > > could > > > > > > >> help > > > > > > >> > > > > clean up > > > > > > >> > > > > >> > the > > > > > > >> > > > > >> > > >> codebase to make it easier to maintain. +1 > on > > > it. > > > > > > >> > > > > >> > > >> > > > > > > >> > > > > >> > > >> Best, > > > > > > >> > > > > >> > > >> Zhanghao Chen > > > > > > >> > > > > >> > > >> ________________________________ > > > > > > >> > > > > >> > > >> From: Weihua Hu <huweihua....@gmail.com> > > > > > > >> > > > > >> > > >> Sent: Monday, February 27, 2023 20:40 > > > > > > >> > > > > >> > > >> To: dev <dev@flink.apache.org> > > > > > > >> > > > > >> > > >> Subject: [DISCUSS] FLIP-298: Unifying the > > > > > > >> Implementation of > > > > > > >> > > > > >> > SlotManager > > > > > > >> > > > > >> > > >> > > > > > > >> > > > > >> > > >> Hi everyone, > > > > > > >> > > > > >> > > >> > > > > > > >> > > > > >> > > >> I would like to begin a discussion on > FLIP-298: > > > > > Unifying > > > > > > >> > the > > > > > > >> > > > > >> > > Implementation > > > > > > >> > > > > >> > > >> of SlotManager[1]. There are currently two > types > > > > of > > > > > > >> > > SlotManager > > > > > > >> > > > > in > > > > > > >> > > > > >> > > Flink: > > > > > > >> > > > > >> > > >> DeclarativeSlotManager and > > > FineGrainedSlotManager. > > > > > > >> > > > > >> > > FineGrainedSlotManager > > > > > > >> > > > > >> > > >> should behave as DeclarativeSlotManager if > the > > > > user > > > > > does > > > > > > >> > not > > > > > > >> > > > > >> configure > > > > > > >> > > > > >> > > the > > > > > > >> > > > > >> > > >> slot request profile. > > > > > > >> > > > > >> > > >> > > > > > > >> > > > > >> > > >> Therefore, this FLIP aims to unify the > > > > > implementation of > > > > > > >> > > > > SlotManager > > > > > > >> > > > > >> > in > > > > > > >> > > > > >> > > >> order to reduce maintenance costs. > > > > > > >> > > > > >> > > >> > > > > > > >> > > > > >> > > >> Looking forward to hearing from you. > > > > > > >> > > > > >> > > >> > > > > > > >> > > > > >> > > >> [1] > > > > > > >> > > > > >> > > >> > > > > > > >> > > > > >> > > >> > > > > > > >> > > > > >> > > >> > > > > > > >> > > > > >> > > > > > > > > >> > > > > >> > > > > > > > >> > > > > >> > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager > > > > > > >> > > > > >> > > >> > > > > > > >> > > > > >> > > >> Best, > > > > > > >> > > > > >> > > >> Weihua > > > > > > >> > > > > >> > > >> > > > > > > >> > > > > >> > > >> > > > > > > >> > > > > >> > > > > > > > > >> > > > > >> > > > > > > > >> > > > > >> > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > >