Maximilian Michels created FLINK-30266: ------------------------------------------
Summary: Recovery reconciliation loop fails if no checkpoint has been created yet Key: FLINK-30266 URL: https://issues.apache.org/jira/browse/FLINK-30266 Project: Flink Issue Type: Bug Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.3.0 Reporter: Maximilian Michels Assignee: Gyula Fora When the upgradeMode is LAST-STATE, the operator fails to reconcile a failed application unless at least one checkpoint has already been created. The expected behavior would be that the job starts with empty state. {noformat} 2022-12-01 10:58:35,596 o.a.f.k.o.l.AuditUtils [INFO ] [app] >>> Status | Error | UPGRADING | {"type":"org.apache.flink.kubernetes.operator.exception.DeploymentFailedException","message":"HA metadata not available to restore from last state. It is possible that the job has finished or terminally failed, or the configmaps have been deleted. Manual restore required.","additionalMetadata":{"reason":"RestoreFailed"},"throwableList":[]} {noformat} {noformat} 2022-12-01 10:44:49,480 i.j.o.p.e.ReconciliationDispatcher [ERROR] [app] Error during event processing ExecutionScope{ resource id: ResourceID{name='app', namespace='namespace'}, version: 216933301} failed. org.apache.flink.kubernetes.operator.exception.ReconciliationException: java.lang.RuntimeException: This indicates a bug... at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:133) at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:54) at io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:136) at io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:94) at org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:80) at io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:93) at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:130) at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:110) at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:81) at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:54) at io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:406) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: java.lang.RuntimeException: This indicates a bug... at org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:180) at org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:61) at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.restoreJob(AbstractJobReconciler.java:212) at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.reconcileSpecChange(AbstractJobReconciler.java:144) at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:167) at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:64) at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:123) ... 13 more {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)