Sergio Sainz created FLINK-31974:
------------------------------------
Summary: JobManager crashes after KubernetesClientException
exception with FatalExitExceptionHandler
Key: FLINK-31974
URL: https://issues.apache.org/jira/browse/FLINK-31974
Project: Flink
Issue Type: Bug
Components: Deployment / Kubernetes
Affects Versions: 1.17.0
Reporter: Sergio Sainz
When resource quota limit is reached JobManager will throw
org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
Failure executing: POST at:
https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
Forbidden!Configured service account doesn't have access. Service account may
have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is
forbidden: exceeded quota: my-namespace-resource-quota, requested:
limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
In {*}1.16.1 , this is handled gracefully{*}:
2023-04-28 22:07:24,631 WARN
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
Failed requesting worker with resource spec WorkerResourceSpec \{cpuCores=1.0,
taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 bytes,
networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb (241591914
bytes), numSlots=4}, current pending count: 0
java.util.concurrent.CompletionException:
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST
at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
Forbidden!Configured service account doesn't have access. Service account may
have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is
forbidden: exceeded quota: my-namespace-resource-quota, requested:
limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown
Source) ~[?:?]
at java.util.concurrent.CompletableFuture.completeThrowable(Unknown
Source) ~[?:?]
at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown Source)
~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
~[?:?]
at java.lang.Thread.run(Unknown Source) ~[?:?]
aused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure
executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods.
Message: Forbidden!Configured service account doesn't have access. Service
account may have been revoked. pods
"my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota:
my-namespace-resource-quota, requested: limits.cpu=3, used: limits.cpu=12100m,
limited: limits.cpu=13.
at
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684)
~[flink-dist-1.16.1.jar:1.16.1]
at
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664)
~[flink-dist-1.16.1.jar:1.16.1]
at
io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613)
~[flink-dist-1.16.1.jar:1.16.1]
at
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558)
~[flink-dist-1.16.1.jar:1.16.1]
at
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
~[flink-dist-1.16.1.jar:1.16.1]
at
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308)
~[flink-dist-1.16.1.jar:1.16.1]
at
io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644)
~[flink-dist-1.16.1.jar:1.16.1]
at
io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83)
~[flink-dist-1.16.1.jar:1.16.1]
at
io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61)
~[flink-dist-1.16.1.jar:1.16.1]
at
org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163)
~[flink-dist-1.16.1.jar:1.16.1]
... 4 more
But , {*}in Flink 1.17.0 , Job Manager crashes{*}:
2023-04-28 20:50:50,534 ERROR org.apache.flink.util.FatalExitExceptionHandler
[] - FATAL: Thread 'flink-akka.actor.default-dispatcher-15' produced
an uncaught exception. Stopping the process...
java.util.concurrent.CompletionException:
org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
Failure executing: POST at:
https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
Forbidden!Configured service account doesn't have access. Service account may
have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is
forbidden: exceeded quota: my-namespace-resource-quota, requested:
limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown
Source) ~[?:?]
at java.util.concurrent.CompletableFuture.completeThrowable(Unknown
Source) ~[?:?]
at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown Source)
~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
~[?:?]
at java.lang.Thread.run(Unknown Source) ~[?:?]
Caused by:
org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
Failure executing: POST at:
https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
Forbidden!Configured service account doesn't have access. Service account may
have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is
forbidden: exceeded quota: my-namespace-resource-quota, requested:
limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
at
org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684)
~[flink-dist-1.17.0.jar:1.17.0]
at
org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664)
~[flink-dist-1.17.0.jar:1.17.0]
at
org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613)
~[flink-dist-1.17.0.jar:1.17.0]
at
org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558)
~[flink-dist-1.17.0.jar:1.17.0]
at
org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
~[flink-dist-1.17.0.jar:1.17.0]
at
org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308)
~[flink-dist-1.17.0.jar:1.17.0]
at
org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644)
~[flink-dist-1.17.0.jar:1.17.0]
at
org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83)
~[flink-dist-1.17.0.jar:1.17.0]
at
org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61)
~[flink-dist-1.17.0.jar:1.17.0]
at
org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163)
~[flink-dist-1.17.0.jar:1.17.0]
... 4 more
--
This message was sent by Atlassian Jira
(v8.20.10#820010)