Rasmus Bilgram created FLINK-38106: -------------------------------------- Summary: Job gets indefinitely stuck with "Job Not Found" events Key: FLINK-38106 URL: https://issues.apache.org/jira/browse/FLINK-38106 Project: Flink Issue Type: Bug Components: Kubernetes Operator Affects Versions: 1.12.1 Reporter: Rasmus Bilgram
We are running flink jobs using last-state upgradeMode. We have experienced that when upgrading the job with a different job graph the job ends up in a undesireable state where we only see "Job Not Found" events, no HA metadata and restoring is only possible from latest savepoint. >From the logs, flink does not allow changing the job graph when restoring from >checkpoint it is only possible to do such upgrade using upgradeMode: savepoint >and we have used that to reproduce the issue. Steps: 1. Upgrade a job with a job graph change using last-state upgradeMode. 2. Job manager pod gets "Caused by: java.lang.IllegalStateException: There is no operator for the state [id]" and restarts 3. When the Job manager starts /overview will return empty list of jobs to the operator 4. Operator put RECONCILING as status - since it is not FAILED no redeployments are attempted 5. Operator starts producing "Job Not Found" events 6. We observed that the HA metadata is also missing 7. Job is stuck until we manually restore from savepoint Wearealittleconcernedifthiscanbecausedbyotherissues,maybeOOMonjobmanagerthenperhapsFAILEDstateisbettertotriggerretrymechanisms. Alternatively it would be great if rollback from latest checkpoint (with former job graph) would be possible. We tried to rollback mechanism but it complained about no HA metadata. It seems similar to: https://issues.apache.org/jira/browse/FLINK-32631 -- This message was sent by Atlassian Jira (v8.20.10#820010)