Bhupendra Yadav created FLINK-32631:
---------------------------------------

             Summary: FlinkSessionJob stuck in Created/Reconciling state 
because of No Job found error in JobManager
                 Key: FLINK-32631
                 URL: https://issues.apache.org/jira/browse/FLINK-32631
             Project: Flink
          Issue Type: Bug
          Components: Kubernetes Operator
    Affects Versions: 1.16.0
         Environment: Local
            Reporter: Bhupendra Yadav


{*}Background{*}: We are using FlinkSessionJob for submitting jobs to a session 
cluster.

{*}Bug{*}: We frequently encounter a problem where the job gets stuck in 
CREATED/RECONCILING state. On checking flink operator logs we see the errorĀ 
{_}Job could not be found{_}. Full traceĀ [here|https://ideone.com/NuAyEK].
 # When a Flink session job is submitted, the Flink operator submits the job to 
the Flink Cluster.
 # If the Flink job manager (JM) restarts for some reason, the job may no 
longer exist in the JM.
 # Upon reconciliation, the Flink operator queries the JM's REST API for the 
job using its jobID, but it receives a 404 error, indicating that the job is 
not found.
 # The operator then encounters an error and logs it, leading to the job 
getting stuck in an indefinite state.
 # Attempting to restart or suspend the job using the operator's provided 
mechanisms also fails because the operator keeps calling the REST API and 
receiving the same 404 error.

{*}Expected Behavior{*}: Ideally, when the Flink operator reconciles a job and 
finds that it no longer exists in the Flink Cluster, it should handle the 
situation gracefully. Instead of getting stuck and logging errors indefinitely, 
the operator should mark the job as failed or deleted, or set an appropriate 
status for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to