Rasmus Bilgram created FLINK-38106:
--------------------------------------

             Summary: Job gets indefinitely stuck with "Job Not Found" events
                 Key: FLINK-38106
                 URL: https://issues.apache.org/jira/browse/FLINK-38106
             Project: Flink
          Issue Type: Bug
          Components: Kubernetes Operator
    Affects Versions: 1.12.1
            Reporter: Rasmus Bilgram


We are running flink jobs using last-state upgradeMode. We have experienced 
that when upgrading the job with a different job graph the job ends up in a 
undesireable state where we only see "Job Not Found" events, no HA metadata and 
restoring is only possible from latest savepoint.
>From the logs, flink does not allow changing the job graph when restoring from 
>checkpoint it is only possible to do such upgrade using upgradeMode: savepoint 
>and we have used that to reproduce the issue.

Steps:
1. Upgrade a job with a job graph change using last-state upgradeMode.
2. Job manager pod gets "Caused by: java.lang.IllegalStateException: There is 
no operator for the state [id]" and restarts
3. When the Job manager starts /overview will return empty list of jobs to the 
operator
4. Operator put RECONCILING as status - since it is not FAILED no redeployments 
are attempted
5. Operator starts producing "Job Not Found" events
6. We observed that the HA metadata is also missing
7. Job is stuck until we manually restore from savepoint

Wearealittleconcernedifthiscanbecausedbyotherissues,maybeOOMonjobmanagerthenperhapsFAILEDstateisbettertotriggerretrymechanisms.
Alternatively it would be great if rollback from latest checkpoint (with former 
job graph) would be possible. We tried to rollback mechanism but it complained 
about no HA metadata.

It seems similar to: https://issues.apache.org/jira/browse/FLINK-32631



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to