Gyula Fora created FLINK-39270:
----------------------------------
Summary: Fix savepoint/last-state upgrade state loss under slow JM
Key: FLINK-39270
URL: https://issues.apache.org/jira/browse/FLINK-39270
Project: Flink
Issue Type: Bug
Components: Kubernetes Operator
Reporter: Gyula Fora
Assignee: Gyula Fora
The current savepoint upgrade logic for terminal jobs assumes that the JM
cleans up all HA state promptly when the job finishes. Under some circumstances
such as large CPU/memory pressure this is not guaranteed and the operator
mistakes this to HA meta available for last-state upgrades.
By the time the job starts that HA metadata may be gone and the job can start
with an empty state leading to complete state loss. We have observed this for
some jobs under CPU throttling.
This change fixes the logic for this upgrades to make sure savepoint upgrade
can always proceed for terminal jobs when they are observed in their state
correctly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)