[ https://issues.apache.org/jira/browse/FLINK-37411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17932494#comment-17932494 ]
Rui Fan commented on FLINK-37411: --------------------------------- Thanks [~gyfora] for the quick feedback! :) {quote}It feels like this rollback mechanism should be part of the Kubernetes operator and not the autoscaler otherwise we are going to be writing the same logic twice. {quote} My initial thought is: the rollback mechanism would be more general as part of the autoscaler, it is compatible with both operator and autoscaler standalone. {quote}Furthermore if a job cannot be started the autoscaler won't even be called (as it's only called for stable jobs). {quote} Do you mean this if condition? [https://github.com/apache/flink-kubernetes-operator/blob/9eb3c385b90a5a2f08376720f3204d1784981a0c/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobAutoScalerImpl.java#L97] If so, could we handle some cases inside of this if condition? {quote}I see the need for the operator to communicate the rollback back to the autoscaler so it can prevent scaling to the same parallelism again but I think the autoscaler should not really be concerned with job failures/rollbacks itself. {quote} Is it possible that Autoscaler controls rollback related logic and the operator (or the autoscaler standalone) is unaware that parallelism is rolled back? For example, autoscaler calls the ScalingRealizer#realizeParallelismOverrides and using the last parallelism or initial parallelism. (Operator will adjust job directly, and doesn't need to care about why it happens). I'm poc-ing it, please correct me if my solution doesn't work, thank you. > Introduce the rollback mechanism for Autoscaler > ------------------------------------------------ > > Key: FLINK-37411 > URL: https://issues.apache.org/jira/browse/FLINK-37411 > Project: Flink > Issue Type: New Feature > Reporter: Rui Fan > Assignee: Rui Fan > Priority: Major > Fix For: kubernetes-operator-1.12.0 > > > h1. Background & Motivation > In some cases, job becomes unhealthy(cannot running normally) after job is > scaled by autoscaler. > One option is rolling back job when job cannot running normally after scaling. > h1. Examples (Which scenarios need rollback mechanism?) > h2. Example1: The network memory is insufficient after scaling up. > Flink task will request more network memories after scaling up. Flink job > cannot be started(failover infinitely) if network memory is insufficient. > The job may have lag before scaling up, but it cannot run after scaling. We > have 2 solutions for this case: > * Autotuning is enabled : increasing the TM network memory and restart a > flink cluster > * Autotuning is disabled(In-place rescaling): Failover(retry) infinitely > will be useless, it's better to rollback job to the last parallelisms or the > first parallelisms. > h2. Example2: GC-pressure or heap-usage is high > Currently, Autoscaling will be paused if the GC pressure exceeds this limit > or the heap usage exceeds this threshold. (Checking > job.autoscaler.memory.gc-pressure.threshold and > job.autoscaler.memory.heap-usage.threshold options to get more details.) > This case might happens after scaling down, there are 2 solutions as well: > * Autotuning is enabled : increasing the TM Heap memory (The TM total memory > may also need to be increased, currently Autotuning never increase the TM > total memory, only decrease it) > * Autotuning is disabled(In-place rescaling): Rollback job to the last > parallelisms or the first parallelisms. > h1. Proposed change > Note: the autotuning could be integrated with these examples in the future. > This Jira introduces the JobUnrecoverableErrorChecker plugins(interfaces), > and we could defines 2 build-in customized checkers in the first > version(case1 and case2). > {code:java} > /** > * Check whether the job encountered an unrecoverable error. > * > * @param <KEY> The job key. > * @param <Context> Instance of JobAutoScalerContext. > */ > @Experimental > public interface JobUnrecoverableErrorChecker<KEY, Context extends > JobAutoScalerContext<KEY>> { > /** > * @return True means job encountered an unrecoverable error, the scaling > will be rolled back. > * Otherwise, the job ran normally or encountered a recoverable error. > */ > boolean check(Context context, EvaluatedMetrics evaluatedMetrics); > } {code} > Rolling back job when any checker rule is true, and the scaling will be > paused until cluster is restarted. > h2. What needs to be discussed is: > should the job be rolled back to the parallelism initially set by the user, > or to the last parallelism before scaling? -- This message was sent by Atlassian Jira (v8.20.10#820010)