Nicolas Fraison created FLINK-32334:
---------------------------------------
Summary: Operator failed to create taskmanager deployment because
it already exist
Key: FLINK-32334
URL: https://issues.apache.org/jira/browse/FLINK-32334
Project: Flink
Issue Type: Bug
Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.5.0
Reporter: Nicolas Fraison
During a job upgrade the operator has failed to start the new job because it
has failed to create the taskmanager deployment:
{code:java}
Jun 12 19:45:28.115 >>> Status | Error | UPGRADING |
{"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.client.deployment.ClusterDeploymentException:
Could not create Kubernetes cluster
\"flink-metering\".","throwableList":[{"type":"org.apache.flink.client.deployment.ClusterDeploymentException","message":"Could
not create Kubernetes cluster
\"flink-metering\"."},{"type":"org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException","message":"Failure
executing: POST at:
https://10.129.144.1/apis/apps/v1/namespaces/metering/deployments. Message:
object is being deleted: deployments.apps \"flink-metering-taskmanager\"
already exists. Received status: Status(apiVersion=v1, code=409,
details=StatusDetails(causes=[], group=apps, kind=deployments,
name=flink-metering-taskmanager, retryAfterSeconds=null, uid=null,
additionalProperties={}), kind=Status, message=object is being deleted:
deployments.apps \"flink-metering-taskmanager\" already exists,
metadata=ListMeta(_continue=null, remainingItemCount=null,
resourceVersion=null, selfLink=null, additionalProperties={}),
reason=AlreadyExists, status=Failure, additionalProperties={})."}]} {code}
As indicated in the error log this is due to taskmanger deployment still
existing while it is under deletion.
Looking at the [source
code|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L150]
we are well relying on FOREGROUND policy by default.
Still it seems that the REST API call to delete only wait until the resource
has been modified and the {{deletionTimestamp}} has been added to the metadata:
[ensure delete returns only when the delete operation is fully finished -
Issue #3246 -
fabric8io/kubernetes-client|https://github.com/fabric8io/kubernetes-client/issues/3246#issuecomment-874019899]
So we could face this issue if the k8s cluster is slow to "really" delete the
deployment
--
This message was sent by Atlassian Jira
(v8.20.10#820010)