gyfora opened a new pull request, #1073:
URL: https://github.com/apache/flink-kubernetes-operator/pull/1073
## Change log
The current savepoint upgrade logic for terminal jobs assumes that the JM
cleans up all HA state promptly when the job finishes. Under some circumstances
such as large CPU/memory pressure this is not guaranteed and the operator
mistakes this to HA meta available for last-state upgrades.
By the time the job starts that HA metadata may be gone and the job can
start with an empty state leading to complete state loss. We have observed this
for some jobs under CPU throttling.
This change fixes the logic for this upgrades to make sure savepoint upgrade
can always proceed for terminal jobs when they are observed in their state
correctly.
## Verifying this change
Unit tests added
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changes to the `CustomResourceDescriptors`:
no
- Core observer or reconciler logic that is regularly executed: yes
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]