Hi all!
After updating the operator to version 1.6.0, suspended and resuming flink jobs
stopped working.
When job resumes, the high availability metadata is removed.
Suspend job:
2023-09-11 06:01:41,548 o.a.f.k.o.l.AuditUtils [INFO ][rec-job/rec-job]
>>> Event | Info | SPECCHANGED | UPGRADE change(s) detected (Diff:
FlinkDeploymentSpec[job.state : running -> suspended]), starting reconciliation.
2023-09-11 06:01:41,548 o.a.f.k.o.r.d.AbstractJobReconciler [INFO
][rec-job/rec-job] Job is in running state, ready for upgrade with LAST_STATE
2023-09-11 06:01:41,558 o.a.f.k.o.l.AuditUtils [INFO ][rec-job/rec-job]
>>> Event | Info | SUSPENDED | Suspending existing deployment.
2023-09-11 06:01:41,558 o.a.f.k.o.s.AbstractFlinkService [INFO
][rec-job/rec-job] Deleting cluster with Foreground propagation
2023-09-11 06:01:41,558 o.a.f.k.o.s.NativeFlinkService [INFO ][rec-job/rec-job]
Deleting JobManager deployment while preserving HA metadata.
2023-09-11 06:01:41,598 o.a.f.k.o.s.AbstractFlinkService [INFO
][rec-job/rec-job] Waiting for cluster shutdown...
2023-09-11 06:01:45,667 o.a.f.k.o.s.AbstractFlinkService [INFO
][rec-job/rec-job] Waiting for cluster shutdown... (5s)
2023-09-11 06:01:50,730 o.a.f.k.o.s.AbstractFlinkService [INFO
][rec-job/rec-job] Waiting for cluster shutdown... (10s)
2023-09-11 06:01:55,837 o.a.f.k.o.s.AbstractFlinkService [INFO
][rec-job/rec-job] Waiting for cluster shutdown... (15s)
2023-09-11 06:02:00,885 o.a.f.k.o.s.AbstractFlinkService [INFO
][rec-job/rec-job] Waiting for cluster shutdown... (20s)
2023-09-11 06:02:01,895 o.a.f.k.o.s.AbstractFlinkService [INFO
][rec-job/rec-job] Cluster shutdown completed.
2023-09-11 06:02:01,973 o.a.f.k.o.l.AuditUtils [INFO ][rec-job/rec-job]
>>> Status | Info | SUSPENDED | The resource (job) has been suspended
2023-09-11 06:02:01,981 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO
][rec-job/rec-job] Resource fully reconciled, nothing to do...
Resume:
2023-09-11 06:02:07,481 o.a.f.k.o.o.d.ApplicationObserver [INFO
][rec-job/rec-job] Observing JobManager deployment. Previous status: MISSING
2023-09-11 06:02:07,488 o.a.f.k.o.o.d.ApplicationObserver [INFO
][rec-job/rec-job] JobManager is being deployed
2023-09-11 06:02:07,563 o.a.f.k.o.l.AuditUtils [INFO ][rec-job/rec-job]
>>> Status | Info | SUSPENDED | The resource (job) has been suspended
2023-09-11 06:02:07,576 o.a.f.k.o.l.AuditUtils [INFO ][rec-job/rec-job]
>>> Event | Info | SPECCHANGED | UPGRADE change(s) detected (Diff:
FlinkDeploymentSpec[job.state : suspended -> running]), starting reconciliation.
2023-09-11 06:02:07,649 o.a.f.k.o.l.AuditUtils [INFO ][rec-job/rec-job]
>>> Status | Info | UPGRADING | The resource is being upgraded
2023-09-11 06:02:07,649 o.a.f.k.o.r.d.ApplicationReconciler [INFO
][rec-job/rec-job] Deleting deployment with terminated application before new
deployment
2023-09-11 06:02:07,649 o.a.f.k.o.s.AbstractFlinkService [INFO
][rec-job/rec-job] Deleting cluster with Foreground propagation
2023-09-11 06:02:07,649 o.a.f.k.o.s.NativeFlinkService [INFO ][rec-job/rec-job]
Deleting JobManager deployment and HA metadata.
2023-09-11 06:02:07,691 o.a.f.k.o.s.AbstractFlinkService [INFO
][rec-job/rec-job] Waiting for cluster shutdown...
2023-09-11 06:02:07,763 o.a.f.k.o.s.AbstractFlinkService [INFO
][rec-job/rec-job] Cluster shutdown completed.
2023-09-11 06:02:07,763 o.a.f.k.o.s.AbstractFlinkService [INFO
][rec-job/rec-job] Deleting Kubernetes HA metadata
2023-09-11 06:02:07,820 o.a.f.k.o.s.AbstractFlinkService [INFO
][rec-job/rec-job] Waiting for cluster shutdown...
2023-09-11 06:02:07,831 o.a.f.k.o.s.AbstractFlinkService [INFO
][rec-job/rec-job] Cluster shutdown completed.
2023-09-11 06:02:07,975 o.a.f.k.o.l.AuditUtils [INFO ][rec-job/rec-job]
>>> Status | Info | UPGRADING | The resource is being upgraded
2023-09-11 06:02:07,987 o.a.f.k.o.l.AuditUtils [INFO ][rec-job/rec-job]
>>> Event | Info | SUBMIT | Starting deployment
2023-09-11 06:02:07,987 o.a.f.k.o.s.AbstractFlinkService [INFO
][rec-job/rec-job] Deploying application cluster requiring last-state from HA
metadata
2023-09-11 06:02:07,999 o.a.f.k.o.c.FlinkDeploymentController
[ERROR][rec-job/rec-job] Flink recovery failed
2023-09-11 06:02:08,012 o.a.f.k.o.l.AuditUtils [INFO ][rec-job/rec-job]
>>> Event | Warning | RESTOREFAILED | HA metadata not available to restore
from last state. It is possible that the job has finished or terminally failed,
or the configmaps have been deleted. Manual restore required.
2023-09-11 06:02:08,099 o.a.f.k.o.l.AuditUtils [INFO ][rec-job/rec-job]
>>> Status | Error | UPGRADING |
{"type":"org.apache.flink.kubernetes.operator.exception.RecoveryFailureException","message":"HA
metadata not available to restore from last state. It is possible that the job
has finished or terminally failed, or the configmaps have been deleted. Manual
restore required.","additionalMetadata":{},"throwableList":[]}
2023-09-11 06:02:08,193 o.a.f.k.o.l.AuditUtils [INFO ][rec-job/rec-job]
>>> Status | Info | UPGRADING | The resource is being upgraded
2023-09-11 06:02:08,218 o.a.f.k.o.l.AuditUtils [INFO ][rec-job/rec-job]
>>> Event | Info | SUBMIT | Starting deployment
2023-09-11 06:02:08,218 o.a.f.k.o.s.AbstractFlinkService [INFO
][rec-job/rec-job] Deploying application cluster requiring last-state from HA
metadata
2023-09-11 06:02:08,228 o.a.f.k.o.c.FlinkDeploymentController
[ERROR][rec-job/rec-job] Flink recovery failed
________________________________
"This message contains confidential information/commercial secret. If you are
not the intended addressee of this message you may not copy, save, print or
forward it to any third party and you are kindly requested to destroy this
message and notify the sender thereof by email.
Данное сообщение содержит конфиденциальную информацию/информацию, являющуюся
коммерческой тайной. Если Вы не являетесь надлежащим адресатом данного
сообщения, Вы не вправе копировать, сохранять, печатать или пересылать его
каким либо иным лицам. Просьба уничтожить данное сообщение и уведомить об этом
отправителя электронным письмом."