[jira] [Commented] (FLINK-37411) Introduce the rollback mechanism for Autoscaler

Rui Fan (Jira) Wed, 05 Mar 2025 02:43:39 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-37411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17932585#comment-17932585
 ]


Rui Fan commented on FLINK-37411:
---------------------------------

{quote}Let me give you an example. What happens if the user itself introduces a 
breaking change at any point in time? (independent of the autoscaler or 
parallelism settings)
{quote}
Thanks for the clarification! I understand your concerns now. Let me look into 
the operator related logic first.

> Introduce the rollback mechanism for Autoscaler 
> ------------------------------------------------
>
>                 Key: FLINK-37411
>                 URL: https://issues.apache.org/jira/browse/FLINK-37411
>             Project: Flink
>          Issue Type: New Feature
>            Reporter: Rui Fan
>            Assignee: Rui Fan
>            Priority: Major
>             Fix For: kubernetes-operator-1.12.0
>
>
> h1. Background & Motivation
> In some cases, job becomes unhealthy(cannot running normally) after job is 
> scaled by autoscaler.
> One option is rolling back job when job cannot running normally after scaling.
> h1. Examples (Which scenarios need rollback mechanism?)
> h2. Example1: The network memory is insufficient after scaling up.
> Flink task will request more network memories after scaling up. Flink job 
> cannot be started(failover infinitely) if network memory is insufficient.
> The job may have lag before scaling up, but it cannot run after scaling. We 
> have 2 solutions for this case:
>  * Autotuning is enabled : increasing the TM network memory and restart a 
> flink cluster
>  * Autotuning is disabled(In-place rescaling): Failover(retry) infinitely 
> will be useless, it's better to rollback job to the last parallelisms or the 
> first parallelisms.
> h2. Example2: GC-pressure or heap-usage is high
> Currently, Autoscaling will be paused if the GC pressure exceeds this limit 
> or the heap usage exceeds this threshold. (Checking 
> job.autoscaler.memory.gc-pressure.threshold and 
> job.autoscaler.memory.heap-usage.threshold options to get more details.)
> This case might happens after scaling down, there are 2 solutions as well:
>  * Autotuning is enabled : increasing the TM Heap memory (The TM total memory 
> may also need to be increased, currently Autotuning never increase the TM 
> total memory, only decrease it)
>  * Autotuning is disabled(In-place rescaling): Rollback job to the last 
> parallelisms or the first parallelisms.
> h1. Proposed change
> Note: the autotuning could be integrated with these examples in the future.
> This Jira introduces the JobUnrecoverableErrorChecker plugins(interfaces), 
> and we could defines 2 build-in customized checkers in the first 
> version(case1 and case2).
> {code:java}
> /**
>  * Check whether the job encountered an unrecoverable error.
>  *
>  * @param <KEY> The job key.
>  * @param <Context> Instance of JobAutoScalerContext.
>  */
> @Experimental
> public interface JobUnrecoverableErrorChecker<KEY, Context extends 
> JobAutoScalerContext<KEY>> {
>     /**
>      * @return True means job encountered an unrecoverable error, the scaling 
> will be rolled back.
>      *     Otherwise, the job ran normally or encountered a recoverable error.
>      */
>     boolean check(Context context, EvaluatedMetrics evaluatedMetrics);
> } {code}
> Rolling back job when any checker rule is true, and the scaling will be 
> paused until cluster is restarted.
> h2. What needs to be discussed is:
> should the job be rolled back to the parallelism initially set by the user, 
> or to the last parallelism before scaling?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-37411) Introduce the rollback mechanism for Autoscaler

Reply via email to