[ 
https://issues.apache.org/jira/browse/FLINK-39165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ihor Mielientiev updated FLINK-39165:
-------------------------------------
    Description: 
When a FlinkDeployment (session cluster) and its associated FlinkSessionJob(s) 
are both updated simultaneously, the operator reconciles them concurrently. 
This leads to conflicting operations and non-deterministic behavior.

Symptoms:
 * Updating a FlinkSessionJob triggers the operator to savepoint the running 
job and cancel it.
Simultaneously, updating the FlinkDeployment (e.g., image change) triggers the 
operator to restart the session cluster (delete/recreate JM and TM pods).
 * When these happen in parallel:
 ** The in-progress savepoint fails because the cluster is torn down underneath 
it.
 ** The running job's state is lost (no successful savepoint was taken).
 ** After both upgrades complete, the JobManager is running but has no active 
jobs the session job was neither gracefully stopped nor automatically 
resubmitted. (Job Not Found error)

 

The result is non-deterministic: the outcome depends entirely on the scheduling 
order of  the two controllers, which is not user-controllable.

 

Expected behavior: when a FlinkDeployment and its FlinkSessionJob(s) are 
updated concurrently, the operator should serialize the upgrades ensuring the 
session job gracefully stops (with a savepoint) before or after the cluster 
upgrade, not simultaneously.

  was:
When a FlinkDeployment (session cluster) and its associated FlinkSessionJob(s) 
are both updated simultaneously, the operator reconciles them concurrently. 
This leads to conflicting operations and non-deterministic behavior.

Symptoms:
 * Updating a FlinkSessionJob triggers the operator to savepoint the running 
job and cancel it.
Simultaneously, updating the FlinkDeployment (e.g., image change) triggers the 
operator to restart the session cluster (delete/recreate JM and TM pods).
 * When these happen in parallel:
 ** The in-progress savepoint fails because the cluster is torn down underneath 
it.
 ** The running job's state is lost (no successful savepoint was taken).
 ** After both upgrades complete, the JobManager is running but has no active 
jobs the session job was neither gracefully stopped nor automatically 
resubmitted. (Job Not Found error)

 

The result is non-deterministic: the outcome depends entirely on the scheduling 
order of  the two controllers, which is not user-controllable.

 

Expected behavior: when a FlinkDeployment and its FlinkSessionJob(s) are 
updated concurrently, the operator should serialize the upgrades — ensuring the 
session job gracefully stops (with a savepoint) before or after the cluster 
upgrade, not simultaneously.


> Concurrent FlinkDeployment and FlinkSessionJob upgrades lead to savepoint 
> failure and state loss 
> -------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39165
>                 URL: https://issues.apache.org/jira/browse/FLINK-39165
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>            Reporter: Ihor Mielientiev
>            Priority: Major
>
> When a FlinkDeployment (session cluster) and its associated 
> FlinkSessionJob(s) are both updated simultaneously, the operator reconciles 
> them concurrently. This leads to conflicting operations and non-deterministic 
> behavior.
> Symptoms:
>  * Updating a FlinkSessionJob triggers the operator to savepoint the running 
> job and cancel it.
> Simultaneously, updating the FlinkDeployment (e.g., image change) triggers 
> the operator to restart the session cluster (delete/recreate JM and TM pods).
>  * When these happen in parallel:
>  ** The in-progress savepoint fails because the cluster is torn down 
> underneath it.
>  ** The running job's state is lost (no successful savepoint was taken).
>  ** After both upgrades complete, the JobManager is running but has no active 
> jobs the session job was neither gracefully stopped nor automatically 
> resubmitted. (Job Not Found error)
>  
> The result is non-deterministic: the outcome depends entirely on the 
> scheduling order of  the two controllers, which is not user-controllable.
>  
> Expected behavior: when a FlinkDeployment and its FlinkSessionJob(s) are 
> updated concurrently, the operator should serialize the upgrades ensuring the 
> session job gracefully stops (with a savepoint) before or after the cluster 
> upgrade, not simultaneously.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to