huntercc created FLINK-23905:
--------------------------------

             Summary: Reduce the load on JobManager when submitting large-scale 
job with a big user jar
                 Key: FLINK-23905
                 URL: https://issues.apache.org/jira/browse/FLINK-23905
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Task
            Reporter: huntercc


As described in FLINK-20612 and FLINK-21731, there are some time-consuming 
steps in the job startup phase. Recently, we found that when submitting a 
large-scale job with a large user jar, the time spent on changing the status of 
a task from deploying to running accounts for a high proportion of the total 
time-consuming.

In the task initialization stage, the user jar needs to be pulled from the 
JobManager through BlobService. JobManager has to allocate a lot of computing 
power to distribute the files, which leads to a heavy load in the start-up 
stage. More generally, JobManager fails to respond to the RPC request sent by 
the TaskManager side in time due to high load, causing some timeout exceptions, 
such as akka timeout exception, which leads to job restart and further prolongs 
the start-up time of the job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to