Hi,

While using a last-state upgrade mode on flink-k8s-operator-1.2.0 and
flink-1.14.3, we're occasionally facing the following error:

Status:
>   Cluster Info:
>     Flink - Revision:             98997ea @ 2022-01-08T23:23:54+01:00
>     Flink - Version:              1.14.3
>   Error:                          HA metadata not available to restore
> from last state. It is possible that the job has finished or terminally
> failed, or the configmaps have been deleted. Manual restore required.
>   Job Manager Deployment Status:  ERROR
>   Job Status:
>     Job Id:    e8dd04ea4b03f1817a4a4b9e5282f433
>     Job Name:  flinktest
>     Savepoint Info:
>       Last Periodic Savepoint Timestamp:  0
>       Savepoint History:
>       Trigger Id:
>       Trigger Timestamp:  0
>       Trigger Type:       UNKNOWN
>     Start Time:           1668660381400
>     State:                RECONCILING
>     Update Time:          1668994910151
>   Reconciliation Status:
>     Last Reconciled Spec:  ...
>     Reconciliation Timestamp:  1668660371853
>     State:                     DEPLOYED
>   Task Manager:
>     Label Selector:  component=taskmanager,app=flinktest
>     Replicas:        1
> Events:
>   Type     Reason            Age                 From
> Message
>   ----     ------            ----                ----
> -------
>   Normal   JobStatusChanged  30m                 Job
> Job status changed from RUNNING to RESTARTING
>   Normal   JobStatusChanged  29m                 Job
> Job status changed from RESTARTING to CREATED
>   Normal   JobStatusChanged  28m                 Job
> Job status changed from CREATED to RESTARTING
>   Warning  Missing           26m                 JobManagerDeployment
> Missing JobManager deployment
>   Warning  RestoreFailed     9s (x106 over 26m)  JobManagerDeployment  HA
> metadata not available to restore from last state. It is possible that the
> job has finished or terminally failed, or the configmaps have been
> deleted. Manual restore required.
>   Normal   Submit            9s (x106 over 26m)  JobManagerDeployment
> Starting deployment


We're happy with the last state mode most of the time, but we face it
occasionally.

We found that it's not easy to reproduce the problem; we tried to kill JMs
and TMs and even shutdown the nodes on which JMs and TMs are running.

We also checked that the file size is not zero.

Thanks,

Dongwon

Reply via email to