[jira] [Created] (FLINK-32520) FlinkDeployment recovered states from an obsolete savepoint

Ruibin Xing (Jira) Mon, 03 Jul 2023 03:54:08 -0700

Ruibin Xing created FLINK-32520:
-----------------------------------

             Summary: FlinkDeployment recovered states from an obsolete 
savepoint
                 Key: FLINK-32520
                 URL: https://issues.apache.org/jira/browse/FLINK-32520
             Project: Flink
          Issue Type: New Feature
          Components: Kubernetes Operator
    Affects Versions: 1.13.1
            Reporter: Ruibin Xing
         Attachments: flink_kubernetes_operator_0615.csv


Kubernetes Operator version: 1.5.0
 

When upgrading one of our Flink jobs, it recovered from a savepoint created by 
the previous version of the job. The timeline of the job is as follows:
 # I upgraded the job for the first time. The job created a savepoint and 
successfully restored from it.
 # The job was running fine and created several checkpoints.
 # Later, I performed the second upgrade. Soon after submission and before the 
JobManager stopped, I realized I made a mistake in the spec, so I quickly did 
the third upgrade.
 # After the job started, I found that it had recovered from the savepoint 
created during the first upgrade.

 

It appears that there was an error when submitting the third upgrade. However, 
I'm still not quite sure why this would cause Flink to use the obsolete 
savepoint after investigating the code. The related logs for the operator are 
attached below.
 

Although I haven't found the root cause, I came up with some possible fixes:
 # Remove the {{lastSavepoint}} after a job has successfully restored from it.

 # Add options for savepoint, similar to: 
{{kubernetes.operator.job.upgrade.last-state.max.allowed.checkpoint.age}} The 
operator should refuse to recover from the savepoint if the max age is exceeded.

 # Create a flag in the status that records savepoint states. Set the flag to 
false when the savepoint starts and mark it as true when it successfully ends. 
The job should report an error if the flag for the last savepoint is false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-32520) FlinkDeployment recovered states from an obsolete savepoint

Reply via email to