Re: [Discuss] Let's Session Cluster JobManager take a breather (FLIP-257: Flink JobManager Process Split)

Zheng Yu Chen Mon, 29 Aug 2022 00:00:35 -0700

Thanks, for the community fallback suggestions. In fact, the problem I want
to solve is to reduce the current workload of the JobManager (as the title
says, more focus is on how to reduce the workload of the JobManager). First
my idea, I thought of reducing the resource overhead of the JobManager in
FLIP. The largest JobMaster migrates to a new component and hopes to share
this part with other work components to reduce the resource occupancy rate
of the JobManager. But later I thought about it and found that it can be
achieved by horizontally expanding JobManager, rather than adding a new
component to increase the overall coordination layer complexity. Maybe this
idea has a simpler implementation (mentioned later)


Here I would like to share the use of Session Cluster and whats problem in
prod env :

    * JobManager OOM： sometimes the operation faces sudden traffic peaks,
JM cannot perform some temporary horizontal expansion, resulting in
excessive pressure and OOM
    * job recovery time is too long: JobManager restart time is too long
for job redeploy after oom

Although the community advised deployed jobs with Application Mode, some
small jobs (such as short-lived batch jobs, simple stream jobs with only
1-3 slots used for a long time, FlinkSQL debugging jobs, etc.) I still
prefer to use Session Cluster because of this As *@David **Morávek*
said, our resources do not need to be initialized

After a few days of thinking, I think a reply to the question mentioned
earlier by *@David Morávek @Matthias Pohl @Chesnay Schepler @Xintong Song*

* The whole coordination layer brings a certain complexity
* Can solve the problem of high JobManager load, but not the best solution

Based on the above situation: using the current FLIP solution may not be
the optimal solution

After reading your comments carefully, I now have a new idea to share it💡

As *@Matthias Poh*l *@Xintong Song* said, we can consider transferring part
of the JobMaster workload to other Standby JobManagers.The benefits of
doing this are as follows:

* Similar to TaskManager for horizontal expansion, when we have a large
cluster of jobs, it can effectively slow down JVM FGC
* job recovery faster now: we move some JobMaster jobs to another candidate
JobManagers. When recovering jobs, we no longer need one JobManager to
recover all jobs
* Compared with the previous scheme, the complexity is reduced, and most of
the current code can be reused instead of breaking or adding a new
coordination layer

For user's use, they only need to configure the switch of this feature and
the number of JobManagers to enjoy the horizontal expansion of JobManager

If the community thinks this solution is feasible, I will rewrite my FLIP
and organize some of my specific ideas

Looking forward to your suggestions~

Zheng Yu Chen <jam.gz...@gmail.com> 于2022年8月16日周二 17:40写道：

> Hi community ~
>
> I think this title should be quite interesting. The idea is to reduce the
> workload of the JobManager and make the SessionCluster [2] more stable in
> the process of running jobs. I designed a plan for splitting the JobManager
> on FLIP-257 [1]:
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process>
>
> This proposal proposes a splitting scheme for the current process and a
> new process implementation idea that is compatible with the original
> process model: splitting the internal JobMaster component of the
> JobManager, and controlling whether to enable this new process through a
> parameter In the split scheme, when the user configures, the JobMaster will
> make it run as an independent service, reducing the workload of the
> JobManager. By implementing a new Dispatcher to communicate and interact
> with a single split JobMaster or multiple JobMasters, to achieve job
> management
>
> The main features that it provides is:
>
>    - After the user submits the job, the JobMaster thread was split into
>    other processes to run in the past. They no longer run in the JobManager,
>    but in other processes.
>    - Users can deploy multiple components mentioned above, which run
>    multiple JobMaster threads, thereby reducing the workload of the JobManager
>
> Some of the challenging use cases that these features solve are:
>
>    - Compatible with the original job running mode (run JobMaster Thread
>    on JobManager)
>    - Implement a new Dispatcher that forwards client operations related
>    to jobs
>
>
>  I would love to hear and address your thoughts and feedback , and if
> possible drive a FLIP-257 ！
>
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process>
>
> [2]
> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/overview/#session-mode
>
>
> --
>
> Have a nice day ~
>
> ConradJam
>


-- 
Best

ConradJam

Re: [Discuss] Let's Session Cluster JobManager take a breather (FLIP-257: Flink JobManager Process Split)

Reply via email to