[PR] [FLINK-39270] Fix savepoint/last-state upgrade state loss under slow JM [flink-kubernetes-operator]

via GitHub Thu, 19 Mar 2026 06:28:25 -0700


gyfora opened a new pull request, #1073:
URL: https://github.com/apache/flink-kubernetes-operator/pull/1073


   ## Change log
   
   The current savepoint upgrade logic for terminal jobs assumes that the JM 
cleans up all HA state promptly when the job finishes. Under some circumstances 
such as large CPU/memory pressure this is not guaranteed and the operator 
mistakes this to HA meta available for last-state upgrades. 
   
   By the time the job starts that HA metadata may be gone and the job can 
start with an empty state leading to complete state loss. We have observed this 
for some jobs under CPU throttling.
   
   This change fixes the logic for this upgrades to make sure savepoint upgrade 
can always proceed for terminal jobs when they are observed in their state 
correctly.
   
   ## Verifying this change
   
   Unit tests added
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changes to the `CustomResourceDescriptors`: 
no
     - Core observer or reconciler logic that is regularly executed: yes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [FLINK-39270] Fix savepoint/last-state upgrade state loss under slow JM [flink-kubernetes-operator]

Reply via email to