[
https://issues.apache.org/jira/browse/FLINK-39165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ihor Mielientiev updated FLINK-39165:
-------------------------------------
Description:
When a FlinkDeployment (session cluster) and its associated FlinkSessionJob(s)
are both updated simultaneously, the operator reconciles them concurrently.
This leads to conflicting operations and non-deterministic behavior.
Symptoms:
* Updating a FlinkSessionJob triggers the operator to savepoint the running
job and cancel it.
Simultaneously, updating the FlinkDeployment (e.g., image change) triggers the
operator to restart the session cluster (delete/recreate JM and TM pods).
* When these happen in parallel:
** The in-progress savepoint fails because the cluster is torn down underneath
it.
** The running job's state is lost (no successful savepoint was taken).
** After both upgrades complete, the JobManager is running but has no active
jobs the session job was neither gracefully stopped nor automatically
resubmitted. (Job Not Found error)
The result is non-deterministic: the outcome depends entirely on the scheduling
order of the two controllers, which is not user-controllable.
Expected behavior: when a FlinkDeployment and its FlinkSessionJob(s) are
updated concurrently, the operator should serialize the upgrades ensuring the
session job gracefully stops (with a savepoint) before or after the cluster
upgrade, not simultaneously.
was:
When a FlinkDeployment (session cluster) and its associated FlinkSessionJob(s)
are both updated simultaneously, the operator reconciles them concurrently.
This leads to conflicting operations and non-deterministic behavior.
Symptoms:
* Updating a FlinkSessionJob triggers the operator to savepoint the running
job and cancel it.
Simultaneously, updating the FlinkDeployment (e.g., image change) triggers the
operator to restart the session cluster (delete/recreate JM and TM pods).
* When these happen in parallel:
** The in-progress savepoint fails because the cluster is torn down underneath
it.
** The running job's state is lost (no successful savepoint was taken).
** After both upgrades complete, the JobManager is running but has no active
jobs the session job was neither gracefully stopped nor automatically
resubmitted. (Job Not Found error)
The result is non-deterministic: the outcome depends entirely on the scheduling
order of the two controllers, which is not user-controllable.
Expected behavior: when a FlinkDeployment and its FlinkSessionJob(s) are
updated concurrently, the operator should serialize the upgrades — ensuring the
session job gracefully stops (with a savepoint) before or after the cluster
upgrade, not simultaneously.
> Concurrent FlinkDeployment and FlinkSessionJob upgrades lead to savepoint
> failure and state loss
> -------------------------------------------------------------------------------------------------
>
> Key: FLINK-39165
> URL: https://issues.apache.org/jira/browse/FLINK-39165
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Reporter: Ihor Mielientiev
> Priority: Major
>
> When a FlinkDeployment (session cluster) and its associated
> FlinkSessionJob(s) are both updated simultaneously, the operator reconciles
> them concurrently. This leads to conflicting operations and non-deterministic
> behavior.
> Symptoms:
> * Updating a FlinkSessionJob triggers the operator to savepoint the running
> job and cancel it.
> Simultaneously, updating the FlinkDeployment (e.g., image change) triggers
> the operator to restart the session cluster (delete/recreate JM and TM pods).
> * When these happen in parallel:
> ** The in-progress savepoint fails because the cluster is torn down
> underneath it.
> ** The running job's state is lost (no successful savepoint was taken).
> ** After both upgrades complete, the JobManager is running but has no active
> jobs the session job was neither gracefully stopped nor automatically
> resubmitted. (Job Not Found error)
>
> The result is non-deterministic: the outcome depends entirely on the
> scheduling order of the two controllers, which is not user-controllable.
>
> Expected behavior: when a FlinkDeployment and its FlinkSessionJob(s) are
> updated concurrently, the operator should serialize the upgrades ensuring the
> session job gracefully stops (with a savepoint) before or after the cluster
> upgrade, not simultaneously.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)