Gyula Fora created FLINK-27675:
----------------------------------

             Summary: Improve manual savepoint tracking
                 Key: FLINK-27675
                 URL: https://issues.apache.org/jira/browse/FLINK-27675
             Project: Flink
          Issue Type: Improvement
          Components: Kubernetes Operator
            Reporter: Gyula Fora
             Fix For: kubernetes-operator-1.0.0


There are 2 problems with the manual savpeoint result observing logic that can 
cause the reconciler to not make progress with the deployment (recoveries, 
upgrades etc).
 # Whenever the jobmanager deployment is not in READY state or the job itself 
is not RUNNING, the trigger info must be reset and we should not try to query 
it anymore. Flink will not retry the savepoint if the job fails, restarted 
anyways.
 # If there is a sensible error when fetching the savepoint status (such as: 
There is no savepoint operation with triggerId=xxx for job ) we should simply 
reset the trigger. These errors will never go away on their own and will simply 
cause the deployment to get stuck in observing/waiting for a savepoint to 
complete



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to