Rui Fan created FLINK-37411:
-------------------------------

             Summary: Introduce the rollback mechanism for Autoscaler 
                 Key: FLINK-37411
                 URL: https://issues.apache.org/jira/browse/FLINK-37411
             Project: Flink
          Issue Type: New Feature
            Reporter: Rui Fan
            Assignee: Rui Fan
             Fix For: kubernetes-operator-1.12.0


h1. Background & Motivation

In some cases, job becomes unhealthy(cannot running normally) after job is 
scaled by autoscaler.

One option is rolling back job when job cannot running normally after scaling.
h1. Examples (Which scenarios need rollback mechanism?)
h2. Example1: The network memory is insufficient after scaling up.

Flink task will request more network memories after scaling up. Flink job 
cannot be started(failover infinitely) if network memory is insufficient.

The job may have lag before scaling up, but it cannot run after scaling. We 
have 2 solutions for this case:
 * Autotuning is enabled : increasing the TM network memory and restart a flink 
cluster
 * Autotuning is disabled(In-place rescaling): Failover(retry) infinitely will 
be useless, it's better to rollback job to the last parallelisms or the first 
parallelisms.

h2. Example2: GC-pressure or heap-usage is high

Currently, Autoscaling will be paused if the GC pressure exceeds this limit or 
the heap usage exceeds this threshold. (Checking 
job.autoscaler.memory.gc-pressure.threshold and 
job.autoscaler.memory.heap-usage.threshold options to get more details.)

This case might happens after scaling down, there are 2 solutions as well:
 * Autotuning is enabled : increasing the TM Heap memory (The TM total memory 
may also need to be increased, currently Autotuning never increase the TM total 
memory, only decrease it)
 * Autotuning is disabled(In-place rescaling): Rollback job to the last 
parallelisms or the first parallelisms.

h1. Proposed change

Note: the autotuning could be integrated with these examples in the future.

This Jira introduces the JobUnrecoverableErrorChecker plugins(interfaces), and 
we could defines 2 build-in customized checkers in the first version(case1 and 
case2).
{code:java}
/**
 * Check whether the job encountered an unrecoverable error.
 *
 * @param <KEY> The job key.
 * @param <Context> Instance of JobAutoScalerContext.
 */
@Experimental
public interface JobUnrecoverableErrorChecker<KEY, Context extends 
JobAutoScalerContext<KEY>> {

    /**
     * @return True means job encountered an unrecoverable error, the scaling 
will be rolled back.
     *     Otherwise, the job ran normally or encountered a recoverable error.
     */
    boolean check(Context context, EvaluatedMetrics evaluatedMetrics);
} {code}
Rolling back job when any checker rule is true, and the scaling will be paused 
until cluster is restarted.
h2. What needs to be discussed is:

should the job be rolled back to the parallelism initially set by the user, or 
to the last parallelism before scaling?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to