Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

Vishal Santoshi Tue, 04 Jun 2019 11:41:48 -0700

The above is flink 1.8

On Tue, Jun 4, 2019 at 12:32 PM Vishal Santoshi <vishal.santo...@gmail.com>
wrote:


> I had a sequence of events that created this issue.
>
> * I started a job and I had the state.checkpoints.num-retained: 5
>
> * As expected I have 5 latest checkpoints retained in my hdfs backend.
>
>
> * JM dies ( K8s limit etc ) without cleaning the hdfs directory.  The k8s
> job restores from the latest checkpoint ( I think ) but as it creates new
> checkpoints it does not delete the older chk point. At the end there are
> now 10 chkpoints,  5 from the old run which remain static and 5 latest
> representing the on going pipe.
>
> * The JM dies again and restart  from the latest from the 5 old
> checkpoints.
>
> This looks a bug in the Job Cluster implementation of flink. It looks like
> it is taking the 5th checkpoint from the beginning based on num-retained
> value, Note that it has the same job id and does not scope to a new
> directory.
>
>
> https://github.com/apache/flink/blob/1dfdaa417ab7cdca9bef1efe6381c7eb67022aaf/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L109
>
> Please tell me if this does not make sense.
>
> Vishal
>
>
>
>
>
>

Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

Reply via email to