Ihor Mielientiev created FLINK-39165:
----------------------------------------
Summary: Concurrent FlinkDeployment and FlinkSessionJob upgrades
lead to savepoint failure and state loss
Key: FLINK-39165
URL: https://issues.apache.org/jira/browse/FLINK-39165
Project: Flink
Issue Type: Bug
Components: Kubernetes Operator
Reporter: Ihor Mielientiev
When a FlinkDeployment (session cluster) and its associated FlinkSessionJob(s)
are both updated simultaneously, the operator reconciles them concurrently.
This leads to conflicting operations and non-deterministic behavior.
Symptoms:
* Updating a FlinkSessionJob triggers the operator to savepoint the running
job and cancel it.
Simultaneously, updating the FlinkDeployment (e.g., image change) triggers the
operator to restart the session cluster (delete/recreate JM and TM pods).
* When these happen in parallel:
** The in-progress savepoint fails because the cluster is torn down underneath
it.
** The running job's state is lost (no successful savepoint was taken).
** After both upgrades complete, the JobManager is running but has no active
jobs the session job was neither gracefully stopped nor automatically
resubmitted. (Job Not Found error)
The result is non-deterministic: the outcome depends entirely on the scheduling
order of the two controllers, which is not user-controllable.
Expected behavior: when a FlinkDeployment and its FlinkSessionJob(s) are
updated concurrently, the operator should serialize the upgrades — ensuring the
session job gracefully stops (with a savepoint) before or after the cluster
upgrade, not simultaneously.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)