Thanks, for the community fallback suggestions. In fact, the problem I want to solve is to reduce the current workload of the JobManager (as the title says, more focus is on how to reduce the workload of the JobManager). First my idea, I thought of reducing the resource overhead of the JobManager in FLIP. The largest JobMaster migrates to a new component and hopes to share this part with other work components to reduce the resource occupancy rate of the JobManager. But later I thought about it and found that it can be achieved by horizontally expanding JobManager, rather than adding a new component to increase the overall coordination layer complexity. Maybe this idea has a simpler implementation (mentioned later)
Here I would like to share the use of Session Cluster and whats problem in prod env : * JobManager OOM: sometimes the operation faces sudden traffic peaks, JM cannot perform some temporary horizontal expansion, resulting in excessive pressure and OOM * job recovery time is too long: JobManager restart time is too long for job redeploy after oom Although the community advised deployed jobs with Application Mode, some small jobs (such as short-lived batch jobs, simple stream jobs with only 1-3 slots used for a long time, FlinkSQL debugging jobs, etc.) I still prefer to use Session Cluster because of this As *@David **Morávek* said, our resources do not need to be initialized After a few days of thinking, I think a reply to the question mentioned earlier by *@David Morávek @Matthias Pohl @Chesnay Schepler @Xintong Song* * The whole coordination layer brings a certain complexity * Can solve the problem of high JobManager load, but not the best solution Based on the above situation: using the current FLIP solution may not be the optimal solution After reading your comments carefully, I now have a new idea to share it💡 As *@Matthias Poh*l *@Xintong Song* said, we can consider transferring part of the JobMaster workload to other Standby JobManagers.The benefits of doing this are as follows: * Similar to TaskManager for horizontal expansion, when we have a large cluster of jobs, it can effectively slow down JVM FGC * job recovery faster now: we move some JobMaster jobs to another candidate JobManagers. When recovering jobs, we no longer need one JobManager to recover all jobs * Compared with the previous scheme, the complexity is reduced, and most of the current code can be reused instead of breaking or adding a new coordination layer For user's use, they only need to configure the switch of this feature and the number of JobManagers to enjoy the horizontal expansion of JobManager If the community thinks this solution is feasible, I will rewrite my FLIP and organize some of my specific ideas Looking forward to your suggestions~ Zheng Yu Chen <jam.gz...@gmail.com> 于2022年8月16日周二 17:40写道: > Hi community ~ > > I think this title should be quite interesting. The idea is to reduce the > workload of the JobManager and make the SessionCluster [2] more stable in > the process of running jobs. I designed a plan for splitting the JobManager > on FLIP-257 [1]: > https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split > <https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process> > > This proposal proposes a splitting scheme for the current process and a > new process implementation idea that is compatible with the original > process model: splitting the internal JobMaster component of the > JobManager, and controlling whether to enable this new process through a > parameter In the split scheme, when the user configures, the JobMaster will > make it run as an independent service, reducing the workload of the > JobManager. By implementing a new Dispatcher to communicate and interact > with a single split JobMaster or multiple JobMasters, to achieve job > management > > The main features that it provides is: > > - After the user submits the job, the JobMaster thread was split into > other processes to run in the past. They no longer run in the JobManager, > but in other processes. > - Users can deploy multiple components mentioned above, which run > multiple JobMaster threads, thereby reducing the workload of the JobManager > > Some of the challenging use cases that these features solve are: > > - Compatible with the original job running mode (run JobMaster Thread > on JobManager) > - Implement a new Dispatcher that forwards client operations related > to jobs > > > I would love to hear and address your thoughts and feedback , and if > possible drive a FLIP-257 ! > > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split > <https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process> > > [2] > https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/overview/#session-mode > > > -- > > Have a nice day ~ > > ConradJam > -- Best ConradJam