Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

Vishal Santoshi Wed, 05 Jun 2019 06:29:54 -0700

Ok, I will do that.

On Wed, Jun 5, 2019, 8:25 AM Chesnay Schepler <ches...@apache.org> wrote:


> Can you provide us the jobmanager logs?
>
> After the first restart the JM should have started deleting older
> checkpoints as new ones were created.
> After the second restart the JM should have recovered all 10 checkpoints,
> start from the latest, and start pruning old ones as new ones were created.
>
> So you're running into 2 separate issues here, which is a bit odd.
>
> On 05/06/2019 13:44, Vishal Santoshi wrote:
>
> Any one?
>
> On Tue, Jun 4, 2019, 2:41 PM Vishal Santoshi <vishal.santo...@gmail.com>
> wrote:
>
>> The above is flink 1.8
>>
>> On Tue, Jun 4, 2019 at 12:32 PM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> I had a sequence of events that created this issue.
>>>
>>> * I started a job and I had the state.checkpoints.num-retained: 5
>>>
>>> * As expected I have 5 latest checkpoints retained in my hdfs backend.
>>>
>>>
>>> * JM dies ( K8s limit etc ) without cleaning the hdfs directory.  The
>>> k8s  job restores from the latest checkpoint ( I think ) but as it creates
>>> new checkpoints it does not delete the older chk point. At the end there
>>> are now 10 chkpoints,  5 from the old run which remain static and 5 latest
>>> representing the on going pipe.
>>>
>>> * The JM dies again and restart  from the latest from the 5 old
>>> checkpoints.
>>>
>>> This looks a bug in the Job Cluster implementation of flink. It looks
>>> like it is taking the 5th checkpoint from the beginning based on
>>> num-retained value, Note that it has the same job id and does not scope to
>>> a new directory.
>>>
>>>
>>> https://github.com/apache/flink/blob/1dfdaa417ab7cdca9bef1efe6381c7eb67022aaf/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L109
>>>
>>> Please tell me if this does not make sense.
>>>
>>> Vishal
>>>
>>>
>>>
>>>
>>>
>>>
>

Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

Reply via email to