Paul Lin created FLINK-21251:
--------------------------------

             Summary: Last valid checkpoint metadata lost after job exits 
restart loop
                 Key: FLINK-21251
                 URL: https://issues.apache.org/jira/browse/FLINK-21251
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
    Affects Versions: 1.7.2
            Reporter: Paul Lin
         Attachments: jm_logs

We have a Flink job of a relatively old version, 1.7.1, that failed with no 
valid checkpoint to restore. The job was first affected by a Kafka network 
instability and fell into the restart loop with the policy of 3 restarts in 5 
minutes. After the restarts exhausted, the job turned into the final state 
FAILED and exits. But the problem is that the last valid checkpoint 4585 that 
was restored multiple times during the restarts, was corrupted (no _metadata) 
after the job exited. 

 

I've checked the checkpoint dir on HDFS and found that chk-4585 which was 
finished at 12:16 was modified at 12:23 when jobmanager was shutting down with 
lots of error logs saying the deletes of pending checkpoints somehow failed. So 
I'm suspecting that the checkpoint metadata was unexpectedly deleted by 
jobmanager.

 

The jobmanager logs are attached below.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to