[jira] [Commented] (FLINK-37411) Introduce the rollback mechanism for Autoscaler

Rui Fan (Jira) Tue, 04 Mar 2025 19:41:19 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-37411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17932494#comment-17932494
 ]


Rui Fan commented on FLINK-37411:
---------------------------------

Thanks [~gyfora] for the quick feedback! :)
{quote}It feels like this rollback mechanism should be part of the Kubernetes 
operator and not the autoscaler otherwise we are going to be writing the same 
logic twice.
{quote}
My initial thought is: the rollback mechanism would be more general as part of 
the autoscaler, it is compatible with both operator and autoscaler standalone.
{quote}Furthermore if a job cannot be started the autoscaler won't even be 
called (as it's only called for stable jobs).
{quote}
Do you mean this if condition?  
[https://github.com/apache/flink-kubernetes-operator/blob/9eb3c385b90a5a2f08376720f3204d1784981a0c/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobAutoScalerImpl.java#L97]

If so, could we handle some cases inside of this if condition?
{quote}I see the need for the operator to communicate the rollback back to the 
autoscaler so it can prevent scaling to the same parallelism again but I think 
the autoscaler should not really be concerned with job failures/rollbacks 
itself.
{quote}
Is it possible that Autoscaler controls rollback related logic and the operator 
(or the autoscaler standalone) is unaware that parallelism is rolled back? For 
example, autoscaler calls the ScalingRealizer#realizeParallelismOverrides and 
using the last parallelism or initial parallelism. (Operator will adjust job 
directly, and doesn't need to care about why it happens).

I'm poc-ing it, please correct me if my solution doesn't work, thank you.

> Introduce the rollback mechanism for Autoscaler 
> ------------------------------------------------
>
>                 Key: FLINK-37411
>                 URL: https://issues.apache.org/jira/browse/FLINK-37411
>             Project: Flink
>          Issue Type: New Feature
>            Reporter: Rui Fan
>            Assignee: Rui Fan
>            Priority: Major
>             Fix For: kubernetes-operator-1.12.0
>
>
> h1. Background & Motivation
> In some cases, job becomes unhealthy(cannot running normally) after job is 
> scaled by autoscaler.
> One option is rolling back job when job cannot running normally after scaling.
> h1. Examples (Which scenarios need rollback mechanism?)
> h2. Example1: The network memory is insufficient after scaling up.
> Flink task will request more network memories after scaling up. Flink job 
> cannot be started(failover infinitely) if network memory is insufficient.
> The job may have lag before scaling up, but it cannot run after scaling. We 
> have 2 solutions for this case:
>  * Autotuning is enabled : increasing the TM network memory and restart a 
> flink cluster
>  * Autotuning is disabled(In-place rescaling): Failover(retry) infinitely 
> will be useless, it's better to rollback job to the last parallelisms or the 
> first parallelisms.
> h2. Example2: GC-pressure or heap-usage is high
> Currently, Autoscaling will be paused if the GC pressure exceeds this limit 
> or the heap usage exceeds this threshold. (Checking 
> job.autoscaler.memory.gc-pressure.threshold and 
> job.autoscaler.memory.heap-usage.threshold options to get more details.)
> This case might happens after scaling down, there are 2 solutions as well:
>  * Autotuning is enabled : increasing the TM Heap memory (The TM total memory 
> may also need to be increased, currently Autotuning never increase the TM 
> total memory, only decrease it)
>  * Autotuning is disabled(In-place rescaling): Rollback job to the last 
> parallelisms or the first parallelisms.
> h1. Proposed change
> Note: the autotuning could be integrated with these examples in the future.
> This Jira introduces the JobUnrecoverableErrorChecker plugins(interfaces), 
> and we could defines 2 build-in customized checkers in the first 
> version(case1 and case2).
> {code:java}
> /**
>  * Check whether the job encountered an unrecoverable error.
>  *
>  * @param <KEY> The job key.
>  * @param <Context> Instance of JobAutoScalerContext.
>  */
> @Experimental
> public interface JobUnrecoverableErrorChecker<KEY, Context extends 
> JobAutoScalerContext<KEY>> {
>     /**
>      * @return True means job encountered an unrecoverable error, the scaling 
> will be rolled back.
>      *     Otherwise, the job ran normally or encountered a recoverable error.
>      */
>     boolean check(Context context, EvaluatedMetrics evaluatedMetrics);
> } {code}
> Rolling back job when any checker rule is true, and the scaling will be 
> paused until cluster is restarted.
> h2. What needs to be discussed is:
> should the job be rolled back to the parallelism initially set by the user, 
> or to the last parallelism before scaling?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-37411) Introduce the rollback mechanism for Autoscaler

Reply via email to