[ https://issues.apache.org/jira/browse/FLINK-30150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643305#comment-17643305 ]
Peter Vary commented on FLINK-30150: ------------------------------------ This is the exception in the logs: {code:java} 2022-12-05T11:40:59.2665289Z [m[33m2022-12-05 11:40:26,746[m [36mo.a.f.k.o.o.d.SessionObserver [m [1;31m[ERROR][default/session-cluster-1] REST service in session cluster is bad now 2022-12-05T11:40:59.2665851Z java.util.concurrent.TimeoutException 2022-12-05T11:40:59.2666258Z at java.base/java.util.concurrent.CompletableFuture.timedGet(Unknown Source) 2022-12-05T11:40:59.2666841Z at java.base/java.util.concurrent.CompletableFuture.get(Unknown Source) 2022-12-05T11:40:59.2667549Z at org.apache.flink.kubernetes.operator.service.AbstractFlinkService.listJobs(AbstractFlinkService.java:231) 2022-12-05T11:40:59.2668462Z at org.apache.flink.kubernetes.operator.observer.deployment.SessionObserver.observeFlinkCluster(SessionObserver.java:48) 2022-12-05T11:40:59.2669809Z at org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeInternal(AbstractFlinkDeploymentObserver.java:89) 2022-12-05T11:40:59.2671385Z at org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeInternal(AbstractFlinkDeploymentObserver.java:55) 2022-12-05T11:40:59.2672514Z at org.apache.flink.kubernetes.operator.observer.AbstractFlinkResourceObserver.observe(AbstractFlinkResourceObserver.java:56) 2022-12-05T11:40:59.2673507Z at org.apache.flink.kubernetes.operator.observer.AbstractFlinkResourceObserver.observe(AbstractFlinkResourceObserver.java:32) 2022-12-05T11:40:59.2674466Z at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:113) 2022-12-05T11:40:59.2675692Z at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:54) 2022-12-05T11:40:59.2676509Z at io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:136) 2022-12-05T11:40:59.2677043Z at io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:94) 2022-12-05T11:40:59.2677741Z at org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:80) 2022-12-05T11:40:59.2678451Z at io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:93) 2022-12-05T11:40:59.2679180Z at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:130) 2022-12-05T11:40:59.2680055Z at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:110) 2022-12-05T11:40:59.2681621Z at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:81) 2022-12-05T11:40:59.2682478Z at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:54) 2022-12-05T11:40:59.2683241Z at io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:406) 2022-12-05T11:40:59.2683817Z at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 2022-12-05T11:40:59.2684294Z at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 2022-12-05T11:40:59.2684676Z at java.base/java.lang.Thread.run(Unknown Source) {code} The log line show 2022-12-05 11:40:26,746 as the timestamp. This is happening when we manually kill the job to test the recovery: {code:java} 2022-12-05T11:40:12.8330378Z Successfully verified that sessionjob/flink-example-statemachine.status.jobStatus.state is in RUNNING state. 2022-12-05T11:40:12.9711940Z Kill the session-cluster-1-7bc5b4d7cb-t5hgq 2022-12-05T11:40:13.3083721Z Waiting for log "Restoring job ffffffff9b85cb750000000000000001 from Checkpoint"... 2022-12-05T11:40:35.8208688Z Log "Restoring job ffffffff9b85cb750000000000000001 from Checkpoint" shows up. {code} I would say that this is expected. > Evaluate operator error log whitelist entry: REST service in session cluster > is bad now > --------------------------------------------------------------------------------------- > > Key: FLINK-30150 > URL: https://issues.apache.org/jira/browse/FLINK-30150 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator > Reporter: Gabor Somogyi > Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)