[ 
https://issues.apache.org/jira/browse/FLINK-30773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17679895#comment-17679895
 ] 

Pedro Cardoso Silva commented on FLINK-30773:
---------------------------------------------

For other ideas around efficient scaling of Flink jobs that could be looked 
into in the future:
[https://www.usenix.org/system/files/atc22-gu-rong.pdf]

Code implementation: [https://github.com/ATC2022No63/Meces]
Patch with the differences for easier study: [^meces.patch]

> Allow rescaling of jobs based on per-vertex parallelism overrides
> -----------------------------------------------------------------
>
>                 Key: FLINK-30773
>                 URL: https://issues.apache.org/jira/browse/FLINK-30773
>             Project: Flink
>          Issue Type: New Feature
>          Components: Autoscaler, Runtime / Coordination, Runtime / REST
>            Reporter: Maximilian Michels
>            Assignee: Maximilian Michels
>            Priority: Major
>         Attachments: meces.patch
>
>
> FLINK-29501 introduced a way to rescale jobs via a user-provided parallelism 
> overrides map. This feature is already used today by the Autoscaler of the 
> Flink Kubernetes operator. However, it requires a full restart of the Flink 
> job and only supports the application deployment mode.
> In a K8s environment, this is inefficient because all pods for a deployment 
> will be surrendered. Upon restart, they have to be re-acquired. In addition 
> to being slow, this can also lead to situations where resource constraints 
> prevent a restart from executing properly.
> Ideally, we would would want the following to happen on receiving a rescale 
> request:
>  # Rescale API receives a request with a parallelism overrides map (vertexId 
> => parallelism) for a jobId
>  # Compute the number of required task slots using the overrides and the 
> current JobGraph
>  ## If the total number of task slots for the cluster is less than the 
> required number of task slots of the rescale, acquire the missing task slots. 
> Otherwise, do nothing
>  ## Wait for new task slots to become available
>  ## Abort rescale request on timeout
>  # Redeploy the JobGraph / Tasks with the updated parallelisms
>  # Surrender any unused task slots in case of scaling down
>  
> There is an existing "Rescaling" API which is currently disabled. We have to 
> evaluate whether reusing it makes sense.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to