Hi Gyula :-)

Okay, we're gonna upgrade to 1.15 and see what happens.

Thanks a lot for the quick feedback and the detailed explanation!

Best,

Dongwon


On Tue, Nov 22, 2022 at 5:57 PM Gyula Fóra <gyula.f...@gmail.com> wrote:

> Hi Dongwon!
>
> This error mostly occurs when using Flink 1.14 and the Flink cluster goes
> into a terminal state. If a Flink job is FAILED/FINISHED (such as it
> exhausted the retry strategy), in Flink 1.14 the cluster shuts itself down
> and removes the HA metadata.
>
> In these cases the operator will only see that the cluster completely
> disappeared and there is no HA metadata and it will throw the error you
> mentioned. It does not know what happened and doesn't have any way to
> recover checkpoint information.
>
> This is fixed in Flink 1.15 where even after terminal FAILED/FINISHED
> states, the jobmanager would not shut down. This allows the operator to
> observe this terminal state and actually recover the job even if the HA
> metadata was removed.
>
> To summarize, this is mostly caused by Flink 1.14 behaviour that the
> operator cannot control. Upgrading to 1.15 allows much more robustness and
> should eliminate most of these cases.
>
> Cheers,
> Gyula
>
> On Tue, Nov 22, 2022 at 9:43 AM Dongwon Kim <eastcirc...@gmail.com> wrote:
>
>> Hi,
>>
>> While using a last-state upgrade mode on flink-k8s-operator-1.2.0 and
>> flink-1.14.3, we're occasionally facing the following error:
>>
>> Status:
>>>   Cluster Info:
>>>     Flink - Revision:             98997ea @ 2022-01-08T23:23:54+01:00
>>>     Flink - Version:              1.14.3
>>>   Error:                          HA metadata not available to restore
>>> from last state. It is possible that the job has finished or terminally
>>> failed, or the configmaps have been deleted. Manual restore required.
>>>   Job Manager Deployment Status:  ERROR
>>>   Job Status:
>>>     Job Id:    e8dd04ea4b03f1817a4a4b9e5282f433
>>>     Job Name:  flinktest
>>>     Savepoint Info:
>>>       Last Periodic Savepoint Timestamp:  0
>>>       Savepoint History:
>>>       Trigger Id:
>>>       Trigger Timestamp:  0
>>>       Trigger Type:       UNKNOWN
>>>     Start Time:           1668660381400
>>>     State:                RECONCILING
>>>     Update Time:          1668994910151
>>>   Reconciliation Status:
>>>     Last Reconciled Spec:  ...
>>>     Reconciliation Timestamp:  1668660371853
>>>     State:                     DEPLOYED
>>>   Task Manager:
>>>     Label Selector:  component=taskmanager,app=flinktest
>>>     Replicas:        1
>>> Events:
>>>   Type     Reason            Age                 From
>>> Message
>>>   ----     ------            ----                ----
>>> -------
>>>   Normal   JobStatusChanged  30m                 Job
>>> Job status changed from RUNNING to RESTARTING
>>>   Normal   JobStatusChanged  29m                 Job
>>> Job status changed from RESTARTING to CREATED
>>>   Normal   JobStatusChanged  28m                 Job
>>> Job status changed from CREATED to RESTARTING
>>>   Warning  Missing           26m                 JobManagerDeployment
>>> Missing JobManager deployment
>>>   Warning  RestoreFailed     9s (x106 over 26m)  JobManagerDeployment
>>> HA metadata not available to restore from last state. It is possible that
>>> the job has finished or terminally failed, or the configmaps have been
>>> deleted. Manual restore required.
>>>   Normal   Submit            9s (x106 over 26m)  JobManagerDeployment
>>> Starting deployment
>>
>>
>> We're happy with the last state mode most of the time, but we face it
>> occasionally.
>>
>> We found that it's not easy to reproduce the problem; we tried to kill
>> JMs and TMs and even shutdown the nodes on which JMs and TMs are running.
>>
>> We also checked that the file size is not zero.
>>
>> Thanks,
>>
>> Dongwon
>>
>>
>>

Reply via email to