Hi Gyula :-) Okay, we're gonna upgrade to 1.15 and see what happens.
Thanks a lot for the quick feedback and the detailed explanation! Best, Dongwon On Tue, Nov 22, 2022 at 5:57 PM Gyula Fóra <gyula.f...@gmail.com> wrote: > Hi Dongwon! > > This error mostly occurs when using Flink 1.14 and the Flink cluster goes > into a terminal state. If a Flink job is FAILED/FINISHED (such as it > exhausted the retry strategy), in Flink 1.14 the cluster shuts itself down > and removes the HA metadata. > > In these cases the operator will only see that the cluster completely > disappeared and there is no HA metadata and it will throw the error you > mentioned. It does not know what happened and doesn't have any way to > recover checkpoint information. > > This is fixed in Flink 1.15 where even after terminal FAILED/FINISHED > states, the jobmanager would not shut down. This allows the operator to > observe this terminal state and actually recover the job even if the HA > metadata was removed. > > To summarize, this is mostly caused by Flink 1.14 behaviour that the > operator cannot control. Upgrading to 1.15 allows much more robustness and > should eliminate most of these cases. > > Cheers, > Gyula > > On Tue, Nov 22, 2022 at 9:43 AM Dongwon Kim <eastcirc...@gmail.com> wrote: > >> Hi, >> >> While using a last-state upgrade mode on flink-k8s-operator-1.2.0 and >> flink-1.14.3, we're occasionally facing the following error: >> >> Status: >>> Cluster Info: >>> Flink - Revision: 98997ea @ 2022-01-08T23:23:54+01:00 >>> Flink - Version: 1.14.3 >>> Error: HA metadata not available to restore >>> from last state. It is possible that the job has finished or terminally >>> failed, or the configmaps have been deleted. Manual restore required. >>> Job Manager Deployment Status: ERROR >>> Job Status: >>> Job Id: e8dd04ea4b03f1817a4a4b9e5282f433 >>> Job Name: flinktest >>> Savepoint Info: >>> Last Periodic Savepoint Timestamp: 0 >>> Savepoint History: >>> Trigger Id: >>> Trigger Timestamp: 0 >>> Trigger Type: UNKNOWN >>> Start Time: 1668660381400 >>> State: RECONCILING >>> Update Time: 1668994910151 >>> Reconciliation Status: >>> Last Reconciled Spec: ... >>> Reconciliation Timestamp: 1668660371853 >>> State: DEPLOYED >>> Task Manager: >>> Label Selector: component=taskmanager,app=flinktest >>> Replicas: 1 >>> Events: >>> Type Reason Age From >>> Message >>> ---- ------ ---- ---- >>> ------- >>> Normal JobStatusChanged 30m Job >>> Job status changed from RUNNING to RESTARTING >>> Normal JobStatusChanged 29m Job >>> Job status changed from RESTARTING to CREATED >>> Normal JobStatusChanged 28m Job >>> Job status changed from CREATED to RESTARTING >>> Warning Missing 26m JobManagerDeployment >>> Missing JobManager deployment >>> Warning RestoreFailed 9s (x106 over 26m) JobManagerDeployment >>> HA metadata not available to restore from last state. It is possible that >>> the job has finished or terminally failed, or the configmaps have been >>> deleted. Manual restore required. >>> Normal Submit 9s (x106 over 26m) JobManagerDeployment >>> Starting deployment >> >> >> We're happy with the last state mode most of the time, but we face it >> occasionally. >> >> We found that it's not easy to reproduce the problem; we tried to kill >> JMs and TMs and even shutdown the nodes on which JMs and TMs are running. >> >> We also checked that the file size is not zero. >> >> Thanks, >> >> Dongwon >> >> >>