Hi, While using a last-state upgrade mode on flink-k8s-operator-1.2.0 and flink-1.14.3, we're occasionally facing the following error:
Status: > Cluster Info: > Flink - Revision: 98997ea @ 2022-01-08T23:23:54+01:00 > Flink - Version: 1.14.3 > Error: HA metadata not available to restore > from last state. It is possible that the job has finished or terminally > failed, or the configmaps have been deleted. Manual restore required. > Job Manager Deployment Status: ERROR > Job Status: > Job Id: e8dd04ea4b03f1817a4a4b9e5282f433 > Job Name: flinktest > Savepoint Info: > Last Periodic Savepoint Timestamp: 0 > Savepoint History: > Trigger Id: > Trigger Timestamp: 0 > Trigger Type: UNKNOWN > Start Time: 1668660381400 > State: RECONCILING > Update Time: 1668994910151 > Reconciliation Status: > Last Reconciled Spec: ... > Reconciliation Timestamp: 1668660371853 > State: DEPLOYED > Task Manager: > Label Selector: component=taskmanager,app=flinktest > Replicas: 1 > Events: > Type Reason Age From > Message > ---- ------ ---- ---- > ------- > Normal JobStatusChanged 30m Job > Job status changed from RUNNING to RESTARTING > Normal JobStatusChanged 29m Job > Job status changed from RESTARTING to CREATED > Normal JobStatusChanged 28m Job > Job status changed from CREATED to RESTARTING > Warning Missing 26m JobManagerDeployment > Missing JobManager deployment > Warning RestoreFailed 9s (x106 over 26m) JobManagerDeployment HA > metadata not available to restore from last state. It is possible that the > job has finished or terminally failed, or the configmaps have been > deleted. Manual restore required. > Normal Submit 9s (x106 over 26m) JobManagerDeployment > Starting deployment We're happy with the last state mode most of the time, but we face it occasionally. We found that it's not easy to reproduce the problem; we tried to kill JMs and TMs and even shutdown the nodes on which JMs and TMs are running. We also checked that the file size is not zero. Thanks, Dongwon