[ https://issues.apache.org/jira/browse/FLINK-30266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17642133#comment-17642133 ]
Thomas Weise commented on FLINK-30266: -------------------------------------- I believe this was discussed before and the reason we decided to not allow this was that we cannot safely determine the reason why the HA metadata is missing. It could be because there was never any successful checkpoint or because it was removed by mistake? As long as we can ensure that we don't accidentally reset a job with prior state to empty state I would also prefer the solution that does not involve manual intervention. > Recovery reconciliation loop fails if no checkpoint has been created yet > ------------------------------------------------------------------------ > > Key: FLINK-30266 > URL: https://issues.apache.org/jira/browse/FLINK-30266 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator > Affects Versions: kubernetes-operator-1.3.0 > Reporter: Maximilian Michels > Assignee: Gyula Fora > Priority: Blocker > Labels: pull-request-available > Fix For: kubernetes-operator-1.3.0 > > > When the upgradeMode is LAST-STATE, the operator fails to reconcile a failed > application unless at least one checkpoint has already been created. The > expected behavior would be that the job starts with empty state. > {noformat} > 2022-12-01 10:58:35,596 o.a.f.k.o.l.AuditUtils [INFO ] [app] >>> > Status | Error | UPGRADING | > {"type":"org.apache.flink.kubernetes.operator.exception.DeploymentFailedException","message":"HA > metadata not available to restore from last state. It is possible that the > job has finished or terminally failed, or the configmaps have been deleted. > Manual restore > required.","additionalMetadata":{"reason":"RestoreFailed"},"throwableList":[]} > {noformat} > {noformat} > 2022-12-01 10:44:49,480 i.j.o.p.e.ReconciliationDispatcher [ERROR] [app] > Error during event processing ExecutionScope{ resource id: > ResourceID{name='app', namespace='namespace'}, version: 216933301} failed. > org.apache.flink.kubernetes.operator.exception.ReconciliationException: > java.lang.RuntimeException: This indicates a bug... > at > org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:133) > at > org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:54) > at > io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:136) > at > io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:94) > at > org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:80) > at > io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:93) > at > io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:130) > at > io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:110) > at > io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:81) > at > io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:54) > at > io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:406) > at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown > Source) > at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown > Source) > at java.base/java.lang.Thread.run(Unknown Source) > Caused by: java.lang.RuntimeException: This indicates a bug... > at > org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:180) > at > org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:61) > at > org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.restoreJob(AbstractJobReconciler.java:212) > at > org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.reconcileSpecChange(AbstractJobReconciler.java:144) > at > org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:167) > at > org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:64) > at > org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:123) > ... 13 more {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)