xiangyu feng created FLINK-33683: ------------------------------------ Summary: Improve the performance of submitting jobs and fetching results to a running flink cluster Key: FLINK-33683 URL: https://issues.apache.org/jira/browse/FLINK-33683 Project: Flink Issue Type: Improvement Components: Client / Job Submission, Table SQL / Client Reporter: xiangyu feng
There is now a lot of unnecessary overhead involved in submitting jobs and fetching results to a long-running flink cluster. This works well for streaming and batch job, because in these scenarios users will not frequently submit jobs and fetch result to a running cluster. But in OLAP scenario, users will continuously submit lots of short-lived jobs to the running cluster. In this situation, these overhead will have a huge impact on the E2E performance. Here are some examples of unnecessary overhead: * Each `RemoteExecutor` will create a new `StandaloneClusterDescriptor` when executing a job on the same remote cluster * `StandaloneClusterDescriptor` will always create a new `RestClusterClient` when retrieving an existing Flink Cluster * Each `RestClusterClient` will create a new `ClientHighAvailabilityServices` which might contains a resource-consuming ha client(ZKClient or KubeClient) and a time-consuming leader retrieval operation * `RestClient` will create a new connection for every request which costs extra connection establishment time Therefore, I suggest creating this ticket and following subtasks to improve this performance. This ticket is also relates to FLINK-25318. -- This message was sent by Atlassian Jira (v8.20.10#820010)