Bhupendra Yadav created FLINK-32631: ---------------------------------------
Summary: FlinkSessionJob stuck in Created/Reconciling state because of No Job found error in JobManager Key: FLINK-32631 URL: https://issues.apache.org/jira/browse/FLINK-32631 Project: Flink Issue Type: Bug Components: Kubernetes Operator Affects Versions: 1.16.0 Environment: Local Reporter: Bhupendra Yadav {*}Background{*}: We are using FlinkSessionJob for submitting jobs to a session cluster. {*}Bug{*}: We frequently encounter a problem where the job gets stuck in CREATED/RECONCILING state. On checking flink operator logs we see the errorĀ {_}Job could not be found{_}. Full traceĀ [here|https://ideone.com/NuAyEK]. # When a Flink session job is submitted, the Flink operator submits the job to the Flink Cluster. # If the Flink job manager (JM) restarts for some reason, the job may no longer exist in the JM. # Upon reconciliation, the Flink operator queries the JM's REST API for the job using its jobID, but it receives a 404 error, indicating that the job is not found. # The operator then encounters an error and logs it, leading to the job getting stuck in an indefinite state. # Attempting to restart or suspend the job using the operator's provided mechanisms also fails because the operator keeps calling the REST API and receiving the same 404 error. {*}Expected Behavior{*}: Ideally, when the Flink operator reconciles a job and finds that it no longer exists in the Flink Cluster, it should handle the situation gracefully. Instead of getting stuck and logging errors indefinitely, the operator should mark the job as failed or deleted, or set an appropriate status for it. -- This message was sent by Atlassian Jira (v8.20.10#820010)