Ihor Mielientiev created FLINK-39165:
----------------------------------------

             Summary: Concurrent FlinkDeployment and FlinkSessionJob upgrades 
lead to savepoint failure and state loss 
                 Key: FLINK-39165
                 URL: https://issues.apache.org/jira/browse/FLINK-39165
             Project: Flink
          Issue Type: Bug
          Components: Kubernetes Operator
            Reporter: Ihor Mielientiev


When a FlinkDeployment (session cluster) and its associated FlinkSessionJob(s) 
are both updated simultaneously, the operator reconciles them concurrently. 
This leads to conflicting operations and non-deterministic behavior.

Symptoms:
 * Updating a FlinkSessionJob triggers the operator to savepoint the running 
job and cancel it.
Simultaneously, updating the FlinkDeployment (e.g., image change) triggers the 
operator to restart the session cluster (delete/recreate JM and TM pods).
 * When these happen in parallel:
 ** The in-progress savepoint fails because the cluster is torn down underneath 
it.
 ** The running job's state is lost (no successful savepoint was taken).
 ** After both upgrades complete, the JobManager is running but has no active 
jobs the session job was neither gracefully stopped nor automatically 
resubmitted. (Job Not Found error)

 

The result is non-deterministic: the outcome depends entirely on the scheduling 
order of  the two controllers, which is not user-controllable.

 

Expected behavior: when a FlinkDeployment and its FlinkSessionJob(s) are 
updated concurrently, the operator should serialize the upgrades — ensuring the 
session job gracefully stops (with a savepoint) before or after the cluster 
upgrade, not simultaneously.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to