[ https://issues.apache.org/jira/browse/FLINK-37411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17932585#comment-17932585 ]
Rui Fan commented on FLINK-37411: --------------------------------- {quote}Let me give you an example. What happens if the user itself introduces a breaking change at any point in time? (independent of the autoscaler or parallelism settings) {quote} Thanks for the clarification! I understand your concerns now. Let me look into the operator related logic first. > Introduce the rollback mechanism for Autoscaler > ------------------------------------------------ > > Key: FLINK-37411 > URL: https://issues.apache.org/jira/browse/FLINK-37411 > Project: Flink > Issue Type: New Feature > Reporter: Rui Fan > Assignee: Rui Fan > Priority: Major > Fix For: kubernetes-operator-1.12.0 > > > h1. Background & Motivation > In some cases, job becomes unhealthy(cannot running normally) after job is > scaled by autoscaler. > One option is rolling back job when job cannot running normally after scaling. > h1. Examples (Which scenarios need rollback mechanism?) > h2. Example1: The network memory is insufficient after scaling up. > Flink task will request more network memories after scaling up. Flink job > cannot be started(failover infinitely) if network memory is insufficient. > The job may have lag before scaling up, but it cannot run after scaling. We > have 2 solutions for this case: > * Autotuning is enabled : increasing the TM network memory and restart a > flink cluster > * Autotuning is disabled(In-place rescaling): Failover(retry) infinitely > will be useless, it's better to rollback job to the last parallelisms or the > first parallelisms. > h2. Example2: GC-pressure or heap-usage is high > Currently, Autoscaling will be paused if the GC pressure exceeds this limit > or the heap usage exceeds this threshold. (Checking > job.autoscaler.memory.gc-pressure.threshold and > job.autoscaler.memory.heap-usage.threshold options to get more details.) > This case might happens after scaling down, there are 2 solutions as well: > * Autotuning is enabled : increasing the TM Heap memory (The TM total memory > may also need to be increased, currently Autotuning never increase the TM > total memory, only decrease it) > * Autotuning is disabled(In-place rescaling): Rollback job to the last > parallelisms or the first parallelisms. > h1. Proposed change > Note: the autotuning could be integrated with these examples in the future. > This Jira introduces the JobUnrecoverableErrorChecker plugins(interfaces), > and we could defines 2 build-in customized checkers in the first > version(case1 and case2). > {code:java} > /** > * Check whether the job encountered an unrecoverable error. > * > * @param <KEY> The job key. > * @param <Context> Instance of JobAutoScalerContext. > */ > @Experimental > public interface JobUnrecoverableErrorChecker<KEY, Context extends > JobAutoScalerContext<KEY>> { > /** > * @return True means job encountered an unrecoverable error, the scaling > will be rolled back. > * Otherwise, the job ran normally or encountered a recoverable error. > */ > boolean check(Context context, EvaluatedMetrics evaluatedMetrics); > } {code} > Rolling back job when any checker rule is true, and the scaling will be > paused until cluster is restarted. > h2. What needs to be discussed is: > should the job be rolled back to the parallelism initially set by the user, > or to the last parallelism before scaling? -- This message was sent by Atlassian Jira (v8.20.10#820010)