Gyula Fora created FLINK-39270:
----------------------------------

             Summary: Fix savepoint/last-state upgrade state loss under slow JM
                 Key: FLINK-39270
                 URL: https://issues.apache.org/jira/browse/FLINK-39270
             Project: Flink
          Issue Type: Bug
          Components: Kubernetes Operator
            Reporter: Gyula Fora
            Assignee: Gyula Fora


The current savepoint upgrade logic for terminal jobs assumes that the JM 
cleans up all HA state promptly when the job finishes. Under some circumstances 
such as large CPU/memory pressure this is not guaranteed and the operator 
mistakes this to HA meta available for last-state upgrades.

By the time the job starts that HA metadata may be gone and the job can start 
with an empty state leading to complete state loss. We have observed this for 
some jobs under CPU throttling.

This change fixes the logic for this upgrades to make sure savepoint upgrade 
can always proceed for terminal jobs when they are observed in their state 
correctly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to