[ https://issues.apache.org/jira/browse/FLINK-20612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhu Zhu updated FLINK-20612: ---------------------------- Fix Version/s: 1.13.0 > Add benchmarks for scheduler > ---------------------------- > > Key: FLINK-20612 > URL: https://issues.apache.org/jira/browse/FLINK-20612 > Project: Flink > Issue Type: Improvement > Components: Benchmarks > Affects Versions: 1.13.0 > Reporter: Zhilong Hong > Assignee: Zhilong Hong > Priority: Major > Fix For: 1.13.0 > > > With Flink 1.12, we failed to run large-scale jobs on our cluster. When we > were trying to run the jobs, we met the exceptions like out of heap memory, > taskmanager heartbeat timeout, and etc. We increased the size of heap memory > and extended the heartbeat timeout, the job still failed. After the > troubleshooting, we found that there are some performance bottlenecks in the > jobmaster. These bottlenecks are highly related to the complexity of the > topology. > We implemented several benchmarks on these bottlenecks based on > flink-benchmark. The topology of the benchmarks is a simple graph, which > consists of only two vertices: one source vertex and one sink vertex. They > are both connected with all-to-all blocking edges. The parallelisms of the > vertices are both 8000. The execution mode is batch. The results of the > benchmarks are illustrated below: > Table 1: The result of benchmarks on bottlenecks in the jobmaster > | |*Time spent*| > |Build topology|19970.44 ms| > |Init scheduling strategy|38167.351 ms| > |Deploy tasks|15102.850 ms| > |Calculate failover region to restart|12080.271 ms| > We'd like to propose these benchmarks for procedures related to the > scheduler. There are three main benefits: > # They help us to understand the current status of task deployment > performance and locate where the bottleneck is. > # We can use the benchmarks to evaluate the optimization in the future. > # As we run the benchmarks daily, they will help us to trace how the > performance changes and locate the commit that introduces the performance > regression if there is any. > In the first version of the benchmarks, we mainly focus on the procedures we > mentioned above. The methods corresponding to the procedures are: > # Building topology: {{ExecutionGraph#attachJobGraph}} > # Initializing scheduling strategies: > {{PipelinedRegionSchedulingStrategy#init}} > # Deploying tasks: {{Execution#deploy}} > # Calculating failover regions: > {{RestartPipelinedRegionFailoverStrategy#getTasksNeedingRestart}} > In the benchmarks, the topology consists of two vertices: source -> sink. > They are connected with all-to-all edges. The result partition type > ({{PIPELINED}} and {{BLOCKING}}) should be considered separately. -- This message was sent by Atlassian Jira (v8.3.4#803005)