[jira] [Updated] (FLINK-39270) Fix savepoint/last-state upgrade state loss under slow JM

Gyula Fora (Jira) Thu, 19 Mar 2026 06:24:26 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-39270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gyula Fora updated FLINK-39270:
-------------------------------
    Description: 
The current savepoint upgrade logic for terminal jobs assumes that the JM 
cleans up all HA state promptly when the job finishes. Under some circumstances 
such as large CPU/memory pressure this is not guaranteed and the operator 
mistakes this to HA meta available for last-state upgrades.

By the time the job starts that HA metadata may be gone and the job can start 
with an empty state leading to complete state loss. We have observed this for 
some jobs under CPU throttling.

  was:
The current savepoint upgrade logic for terminal jobs assumes that the JM 
cleans up all HA state promptly when the job finishes. Under some circumstances 
such as large CPU/memory pressure this is not guaranteed and the operator 
mistakes this to HA meta available for last-state upgrades.

By the time the job starts that HA metadata may be gone and the job can start 
with an empty state leading to complete state loss. We have observed this for 
some jobs under CPU throttling.

This change fixes the logic for this upgrades to make sure savepoint upgrade 
can always proceed for terminal jobs when they are observed in their state 
correctly.


> Fix savepoint/last-state upgrade state loss under slow JM
> ---------------------------------------------------------
>
>                 Key: FLINK-39270
>                 URL: https://issues.apache.org/jira/browse/FLINK-39270
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>            Reporter: Gyula Fora
>            Assignee: Gyula Fora
>            Priority: Major
>
> The current savepoint upgrade logic for terminal jobs assumes that the JM 
> cleans up all HA state promptly when the job finishes. Under some 
> circumstances such as large CPU/memory pressure this is not guaranteed and 
> the operator mistakes this to HA meta available for last-state upgrades.
> By the time the job starts that HA metadata may be gone and the job can start 
> with an empty state leading to complete state loss. We have observed this for 
> some jobs under CPU throttling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-39270) Fix savepoint/last-state upgrade state loss under slow JM

Reply via email to