[ https://issues.apache.org/jira/browse/FLINK-30773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17679895#comment-17679895 ]
Pedro Cardoso Silva commented on FLINK-30773: --------------------------------------------- For other ideas around efficient scaling of Flink jobs that could be looked into in the future: [https://www.usenix.org/system/files/atc22-gu-rong.pdf] Code implementation: [https://github.com/ATC2022No63/Meces] Patch with the differences for easier study: [^meces.patch] > Allow rescaling of jobs based on per-vertex parallelism overrides > ----------------------------------------------------------------- > > Key: FLINK-30773 > URL: https://issues.apache.org/jira/browse/FLINK-30773 > Project: Flink > Issue Type: New Feature > Components: Autoscaler, Runtime / Coordination, Runtime / REST > Reporter: Maximilian Michels > Assignee: Maximilian Michels > Priority: Major > Attachments: meces.patch > > > FLINK-29501 introduced a way to rescale jobs via a user-provided parallelism > overrides map. This feature is already used today by the Autoscaler of the > Flink Kubernetes operator. However, it requires a full restart of the Flink > job and only supports the application deployment mode. > In a K8s environment, this is inefficient because all pods for a deployment > will be surrendered. Upon restart, they have to be re-acquired. In addition > to being slow, this can also lead to situations where resource constraints > prevent a restart from executing properly. > Ideally, we would would want the following to happen on receiving a rescale > request: > # Rescale API receives a request with a parallelism overrides map (vertexId > => parallelism) for a jobId > # Compute the number of required task slots using the overrides and the > current JobGraph > ## If the total number of task slots for the cluster is less than the > required number of task slots of the rescale, acquire the missing task slots. > Otherwise, do nothing > ## Wait for new task slots to become available > ## Abort rescale request on timeout > # Redeploy the JobGraph / Tasks with the updated parallelisms > # Surrender any unused task slots in case of scaling down > > There is an existing "Rescaling" API which is currently disabled. We have to > evaluate whether reusing it makes sense. -- This message was sent by Atlassian Jira (v8.20.10#820010)