Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

Chesnay Schepler Wed, 05 Jun 2019 05:25:38 -0700

Can you provide us the jobmanager logs?

After the first restart the JM should have started deleting oldercheckpoints as new ones were created.After the second restart the JM should have recovered all 10checkpoints, start from the latest, and start pruning old ones as newones were created.


So you're running into 2 separate issues here, which is a bit odd.

On 05/06/2019 13:44, Vishal Santoshi wrote:

Any one?

On Tue, Jun 4, 2019, 2:41 PM Vishal Santoshi<vishal.santo...@gmail.com <mailto:vishal.santo...@gmail.com>> wrote:


    The above is flink 1.8

    On Tue, Jun 4, 2019 at 12:32 PM Vishal Santoshi
    <vishal.santo...@gmail.com <mailto:vishal.santo...@gmail.com>> wrote:

        I had a sequence of events that created this issue.

        * I started a job and I had the state.checkpoints.num-retained: 5

        * As expected I have 5 latest checkpoints retained in my hdfs
        backend.

        * JM dies ( K8s limit etc ) without cleaning the hdfs
        directory.  The k8s  job restores from the latest checkpoint (
        I think ) but as it creates new checkpoints it does not delete
        the older chk point. At the end there are now 10 chkpoints,  5
        from the old run which remain static and 5 latest representing
        the on going pipe.

        * The JM dies again and restart  from the latest from the 5
        old checkpoints.

        This looks a bug in the Job Cluster implementation of flink.
        It looks like it is taking the 5th checkpoint from the
        beginning based on num-retained value, Note that it has the
        same job id and does not scope to a new directory.

        
https://github.com/apache/flink/blob/1dfdaa417ab7cdca9bef1efe6381c7eb67022aaf/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L109

        Please tell me if this does not make sense.

        Vishal

Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

Reply via email to