[ https://issues.apache.org/jira/browse/FLINK-30315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Peter Vary updated FLINK-30315: ------------------------------- Description: When there is an image pull error, this is what we see in the operator log: {code:java} org.apache.flink.kubernetes.operator.exception.DeploymentFailedException: Back-off pulling image "flink:1.14" at org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.checkContainerBackoff(AbstractFlinkDeploymentObserver.java:194) at org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeJmDeployment(AbstractFlinkDeploymentObserver.java:150) at org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeInternal(AbstractFlinkDeploymentObserver.java:84) at org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeInternal(AbstractFlinkDeploymentObserver.java:55) at org.apache.flink.kubernetes.operator.observer.AbstractFlinkResourceObserver.observe(AbstractFlinkResourceObserver.java:56) at org.apache.flink.kubernetes.operator.observer.AbstractFlinkResourceObserver.observe(AbstractFlinkResourceObserver.java:32) at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:113) at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:54) at io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:136) at io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:94) at org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:80) at io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:93) at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:130) at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:110) at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:81) at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:54) at io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:406) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) {code} This is the information we have on kubernetes side: {code:java} Normal Scheduled 2m19s default-scheduler Successfully assigned default/quickstart-base-86787586cd-lb7j6 to minikube Warning Failed 20s kubelet Failed to pull image "flink:1.14": rpc error: code = Unknown desc = context deadline exceeded *Warning Failed 20s kubelet Error*: ErrImagePull Normal BackOff 19s kubelet Back-off pulling image "flink:1.14" *Warning Failed 19s kubelet Error*: ImagePullBackOff Normal Pulling 7s (x2 over 2m19s) kubelet Pulling image "flink:1.14" {code} It would be good to add the additional message (in this case {{{}Failed to pull image "flink:1.14": rpc error: code = Unknown desc = context deadline exceeded{}}}) to the message of the {{DeploymentFailedException}} for traceability. was: When there is an image pull error, this is what we see in the operator log: {code:java} org.apache.flink.kubernetes.operator.exception.DeploymentFailedException: Back-off pulling image "flink:1.14" at org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.checkContainerBackoff(AbstractFlinkDeploymentObserver.java:194) at org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeJmDeployment(AbstractFlinkDeploymentObserver.java:150) at org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeInternal(AbstractFlinkDeploymentObserver.java:84) at org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeInternal(AbstractFlinkDeploymentObserver.java:55) at org.apache.flink.kubernetes.operator.observer.AbstractFlinkResourceObserver.observe(AbstractFlinkResourceObserver.java:56) at org.apache.flink.kubernetes.operator.observer.AbstractFlinkResourceObserver.observe(AbstractFlinkResourceObserver.java:32) at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:113) at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:54) at io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:136) at io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:94) at org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:80) at io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:93) at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:130) at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:110) at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:81) at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:54) at io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:406) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) {code} This is the information we have on kubernetes side: {code} Normal Scheduled 2m19s default-scheduler Successfully assigned default/quickstart-base-86787586cd-lb7j6 to minikube Warning Failed 20s kubelet Failed to pull image "flink:1.14": rpc error: code = Unknown desc = context deadline exceeded *Warning Failed 20s kubelet Error*: ErrImagePull Normal BackOff 19s kubelet Back-off pulling image "flink:1.14" *Warning Failed 19s kubelet Error*: ImagePullBackOff Normal Pulling 7s (x2 over 2m19s) kubelet Pulling image "flink:1.14" {code} It would be good to add the additional message (in this case {{Failed to pull image "flink:1.14": rpc error: code = Unknown desc = context deadline exceeded}}) to the message of the {{DeploymentFailedException}} for tracebility. > Add more information about image pull failures to the operator log > ------------------------------------------------------------------ > > Key: FLINK-30315 > URL: https://issues.apache.org/jira/browse/FLINK-30315 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator > Reporter: Peter Vary > Priority: Major > > When there is an image pull error, this is what we see in the operator log: > {code:java} > org.apache.flink.kubernetes.operator.exception.DeploymentFailedException: > Back-off pulling image "flink:1.14" > at > org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.checkContainerBackoff(AbstractFlinkDeploymentObserver.java:194) > at > org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeJmDeployment(AbstractFlinkDeploymentObserver.java:150) > at > org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeInternal(AbstractFlinkDeploymentObserver.java:84) > at > org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeInternal(AbstractFlinkDeploymentObserver.java:55) > at > org.apache.flink.kubernetes.operator.observer.AbstractFlinkResourceObserver.observe(AbstractFlinkResourceObserver.java:56) > at > org.apache.flink.kubernetes.operator.observer.AbstractFlinkResourceObserver.observe(AbstractFlinkResourceObserver.java:32) > at > org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:113) > at > org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:54) > at > io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:136) > at > io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:94) > at > org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:80) > at > io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:93) > at > io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:130) > at > io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:110) > at > io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:81) > at > io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:54) > at > io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:406) > at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown > Source) > at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown > Source) > at java.base/java.lang.Thread.run(Unknown Source) {code} > This is the information we have on kubernetes side: > {code:java} > Normal Scheduled 2m19s default-scheduler Successfully > assigned > default/quickstart-base-86787586cd-lb7j6 to minikube > Warning Failed 20s kubelet Failed to pull > image "flink:1.14": rpc error: code = Unknown desc = context deadline exceeded > *Warning Failed 20s kubelet Error*: > ErrImagePull > Normal BackOff 19s kubelet Back-off pulling > image "flink:1.14" > *Warning Failed 19s kubelet Error*: > ImagePullBackOff > Normal Pulling 7s (x2 over 2m19s) kubelet Pulling image > "flink:1.14" > {code} > It would be good to add the additional message (in this case {{{}Failed to > pull image "flink:1.14": rpc error: code = Unknown desc = context deadline > exceeded{}}}) to the message of the {{DeploymentFailedException}} for > traceability. -- This message was sent by Atlassian Jira (v8.20.10#820010)