Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

Vishal Santoshi Wed, 05 Jun 2019 04:45:07 -0700

Any one?

On Tue, Jun 4, 2019, 2:41 PM Vishal Santoshi <vishal.santo...@gmail.com>
wrote:


> The above is flink 1.8
>
> On Tue, Jun 4, 2019 at 12:32 PM Vishal Santoshi <vishal.santo...@gmail.com>
> wrote:
>
>> I had a sequence of events that created this issue.
>>
>> * I started a job and I had the state.checkpoints.num-retained: 5
>>
>> * As expected I have 5 latest checkpoints retained in my hdfs backend.
>>
>>
>> * JM dies ( K8s limit etc ) without cleaning the hdfs directory.  The
>> k8s  job restores from the latest checkpoint ( I think ) but as it creates
>> new checkpoints it does not delete the older chk point. At the end there
>> are now 10 chkpoints,  5 from the old run which remain static and 5 latest
>> representing the on going pipe.
>>
>> * The JM dies again and restart  from the latest from the 5 old
>> checkpoints.
>>
>> This looks a bug in the Job Cluster implementation of flink. It looks
>> like it is taking the 5th checkpoint from the beginning based on
>> num-retained value, Note that it has the same job id and does not scope to
>> a new directory.
>>
>>
>> https://github.com/apache/flink/blob/1dfdaa417ab7cdca9bef1efe6381c7eb67022aaf/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L109
>>
>> Please tell me if this does not make sense.
>>
>> Vishal
>>
>>
>>
>>
>>
>>

Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

Reply via email to