Kartikey Pant created FLINK-37766:
-------------------------------------
Summary: FlinkSessionJob deletion blocked by finalizer when Flink
job already terminal/missing due to HA desync
Key: FLINK-37766
URL: https://issues.apache.org/jira/browse/FLINK-37766
Project: Flink
Issue Type: Bug
Components: Kubernetes Operator
Affects Versions: 1.20.1
Environment: Flink Kubernetes Operator Image:
apache/flink-kubernetes-operator:1.10.0
Flink Image: apache/flink:1.20.1
Kubernetes: minikube version: v1.35.0
Reporter: Kartikey Pant
We've encountered an issue where {{FlinkSessionJob}} custom resources become
stuck in a {{Terminating}} state when deleted via {{{}kubectl delete{}}}. This
occurs after a desynchronization between the Flink Kubernetes Operator and the
Flink JobManager, typically initiated by a JobManager restart where its High
Availability (HA) mechanism fails to recover the state of the pre-existing job.
The sequence of events leading to the problem is as follows:
# A Flink JobManager pod for an active session cluster restarts.
# Upon restart, the JobManager's HA recovery fails to load the state of
previously running jobs. JobManager logs indicate this with messages like:
{{{}Retrieved job ids [] from KubernetesStateHandleStore...{}}}.
# This creates a desynchronization:
** The Flink Operator (via the {{FlinkSessionJob}} CR status) still holds
information about the original Flink JobID and its last known state/savepoint.
It attempts to reconcile this job.
** The newly started Flink JobManager has no internal record of this specific
job instance from its HA recovery.
# The {{FlinkSessionJob}} CR status often remains {{RECONCILING}} as the
Operator tries to manage a job the current JobManager doesn't recognize from
its HA state.
# When {{kubectl delete FlinkSessionJob <job-name>}} is issued, the Operator's
finalizer ({{{}flinksessionjobs.flink.apache.org/finalizer{}}}) logic is
triggered.
# The Operator attempts to cancel the Flink job via the JobManager's REST API
using the JobID from the CR status.
# The Flink JobManager, which either doesn't know the job or has internally
marked it as {{FAILED}} due to the ongoing reconciliation attempts for a
desynchronized job, responds with an error to the cancellation request.
JobManager logs show: {{Job cancellation failed because the job has already
reached another terminal state (FAILED).}}
# The Flink Kubernetes Operator's REST client logic or the finalizer's error
handling does not gracefully process this specific "already FAILED" (or
potentially "not found") response. An exception occurs within the Operator
(visible in Operator logs, often involving {{RestClient.parseResponse}} or
{{{}CompletableFuture.completeExceptionally{}}}).
# Due to this unhandled exception in the finalizer logic, the Operator fails
to remove its finalizer from the {{FlinkSessionJob}} CR.
# Consequently, the {{FlinkSessionJob}} CR remains stuck in the
{{Terminating}} state indefinitely.
The only workaround is to manually edit the {{FlinkSessionJob}} CR and remove
the finalizer, allowing Kubernetes to complete the deletion.
*Steps to Reproduce:*
# Deploy a Flink Session Cluster with HA enabled (e.g., Kubernetes HA).
# Submit a {{FlinkSessionJob}} to the cluster.
# Induce a JobManager restart in such a way that its HA metadata for the
running job is lost or not recoverable (e.g., by temporarily clearing the HA
storage like ConfigMaps before the JobManager fully recovers, or simulating a
crash where HA data isn't written).
# The new JobManager should start without recovering the previous job.
# The {{FlinkSessionJob}} CR may show {{RECONCILING}} as the Operator tries to
manage the desynchronized job.
# Attempt to delete the {{FlinkSessionJob}} CR using {{{}kubectl delete{}}}.
# Observe the Operator logs for exceptions during finalization and the
{{FlinkSessionJob}} CR getting stuck in the {{Terminating}} state.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)